jhj0517 / Whisper-WebUI

A Web UI for easy subtitle using whisper model.
Apache License 2.0
1.42k stars 200 forks source link

Precise Use of Actual Subtitles #323

Open iodides opened 1 month ago

iodides commented 1 month ago

First of all, I want to express my thanks because I'm using it very well.

In general, when you have a script for recorded videos, movies, or music, there is often a fully accurate script available. However, when using Whisper WebUI to convert speech to text, it often doesn't recognize certain words and sentences perfectly, so manual correction is required.

It's difficult, even for AI or humans, to completely understand dialogue just by listening. Therefore, it would be great if, when there is an original script, we could upload a script file (without timestamps) alongside the audio, and the AI could recognize and synchronize the original subtitles with the correct timing.

jhj0517 commented 1 month ago

Hi. If I understand correctly, you want to let the web ui only update "timestamps" with transcription? I'm considering if I should implement this or not and how I should implement this.

And if the hallucination is problem, you can consider using VAD ( Voice Detection ) and BGM Separation filters from the WebUI.

They will feed the cleaner audio to the whisper and most of the hallucinations will disappear just by removing the noise from the audio.

iodides commented 1 month ago

Yes, I have an original script for the video. Of course, the recognition result from Whisper is excellent, but the results are not 100% accurate.

For example, if the original script is: Lost in the maze of broken streets, Twelve paths ahead, where will I meet,

the result from WebUI comes out as:

🎵 Lost in a maze of broken strings 🎵 🎵 To a path ahead, where will I meet? 🎵

So, I have to compare line by line and correct the text.

iodides commented 1 month ago

another samples, Original script: Even if fate decides to blind, I’ll walk the path, leave doubt behind. With every turn, I feel you near, Summer’s light will reappear.

Webui Result: Even if fate decides to bind I'll walk the path leaped out behind With every turn, I fear you're near Sunrise light will reappear

in my case, it's a music

jhj0517 commented 1 month ago

Transcribing music is a really good example of using the Background Music Remover filter in the WebUI. If you haven't tried it yet, I recommend to use it.

Original script: leave doubt behind. Webui Result: leaped out behind

This kind of case seems difficult one. You might consider to use higher beam_size ( Which exists in the Advanced Parameters" tab), like might 10. Higher beam_size slows down the transcription, but makes it more accurate.

As for the feature itself, I see this as a very specific one, I will implement it if others want it as well!

iodides commented 1 month ago

Sample Music.zip