Subtitles start too early

jhj0517 / Whisper-WebUI

A Web UI for easy subtitle using whisper model.

Apache License 2.0

1.3k stars 183 forks source link

Subtitles start too early #249

Open runkzi opened 2 months ago

runkzi commented 2 months ago

Which OS are you using? Windows 10 64 Bit, running latest version not in docker.

Basically the issue is, if a subtitle's audio is actually spoken at let's say the 15th second of the video, it will be present from the beginning until the next subtitle's is supposed to play, instead of start at the 15th second.

Here's an example:

836 00:33:57,240 --> 00:34:00,240 Welcome to the first Flame of Love eviction ceremony.

837 00:34:00,240 --> 00:35:12,159 It's why we're all here.

838 00:35:12,159 --> 00:35:14,550 But tonight, one of you will have to let go

Running large-v3 with VAD turned on, the only setting I actually changed is Minimum Silence Duration (ms), which I set to 1000 from its original setting of 2000, to avoid subtitles repeating throughout.

Not sure if its a bug or just a matter of configuration, but I would appreciate your assistance here.

jhj0517 commented 2 months ago

Hi @runkzi can you upload a sample for testing?

runkzi commented 2 months ago

Hi, Sample of what exactly? The video? The subtitles?

On Sun, 1 Sept 2024, 08:10 jhj0517, @.***> wrote:

Hi @runkzi https://github.com/runkzi Can you upload sample for it to test?

— Reply to this email directly, view it on GitHub https://github.com/jhj0517/Whisper-WebUI/issues/249#issuecomment-2323170740, or unsubscribe https://github.com/notifications/unsubscribe-auth/BK5X6DRLAID7I4HB4R2KERLZUKOUJAVCNFSM6AAAAABNOJ5DZ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRTGE3TANZUGA . You are receiving this because you were mentioned.Message ID: @.***>

jhj0517 commented 2 months ago

@runkzi Sample of the video or audio file to reproduce. Note that large-v3 causes hallucinations more often than large-v2. -https://github.com/jhj0517/Whisper-WebUI/issues/152#issuecomment-2123817314

runkzi commented 2 months ago

Hi, I uploaded a cropped version of the episode in question along with the full one, in case you might need it.

Cropped: https://easyupload.io/9jozji

Full episode: https://easyupload.io/hnf3mh

Let me know if any additional information is required.

jhj0517 commented 2 months ago

Thanks for uploading. This is a whisper's hallucination that often occurs when there's not only human speech, but also background music or other noise.

This was probably caused by the background music in the sample. As far as I know, the best way to reduce such hullucination is to use VAD to transcribe only the human speech part for now. ( And large-v2 is more recommended than large-v3 )

But sometimes VAD is not enough, because it sometimes even detects music parts as speech. So I think adding background music separation pre-processing would help, it's on the TODO list for now.

jhj0517 commented 1 month ago

@runkzi I just added BGM separation pre processing to reduce such hallucinations in #267. It gave me really better & good result on my test with your sample.

You can try with this setting :

https://i.imgur.com/8LFyhpb.png