jhj0517 / Whisper-WebUI

A Web UI for easy subtitle using whisper model.
Apache License 2.0
1.42k stars 200 forks source link

Subtitle generation is not working properly. #152

Open lgs777 opened 6 months ago

lgs777 commented 6 months ago

Which OS are you using?

windows 11

After a long-awaited update, I attempted to generate Chinese subtitles. As time goes on, I'm encountering an issue where subtitles are generated as numbers only from a certain point.


1286 00:25:55,660 --> 00:25:56,660 90

1287 00:25:56,660 --> 00:25:57,660 90

1288 00:25:57,660 --> 00:25:58,660 90

1289 00:25:58,660 --> 00:25:59,660 90

1290 00:25:59,660 --> 00:26:00,660 90

jhj0517 commented 6 months ago

Hi, it seems like whisper hallucination.

Many of possible solutions are discussed here.

You can try

You can adjust these parameters in the "Advanced Parameters" tab of the WebUI.

Setting condition_on_previous_text to False would make texts less consistent about the context, but it will help to whisper to escape the "loop of failures" that you experienced.

no_speech_threshold and log_probability_threshold are the parameters that define how whisper will be "sensetive" to the small sounds. For example, in your case, this might happen because whisper is too sensitive to small sounds.

Increasing both no_speech_threshold and log_probability_threshold would make whisper insensitive to the small sounds.

*Instead of tweaking these parameters, I'll just add a vad_filter parameter that enables the Silero VAD filter for easy use.

jhj0517 commented 6 months ago

Silero VAD Filter is added in #153.

Open the "Advanced Parameters" tab in the WebUI, and check "Enable Silero VAD Filter". If the hallucination still occurs, uncheck "Condition On Previous Text".

If the hallucination still exists with the above methods, please let me know.

RYG81 commented 6 months ago

Increasing temperature also solves this

windo-developer commented 6 months ago

I have also recently encountered the same hallucination issue in Korean. Even when using the vad_filter and adjusting the Advanced Parameters comprehensively, the same hallucination occurs after a certain point.

In my case, I found that changing the Model to large-v2 prevents hallucinations, although the text generation quality decreases.

Previously, there were no issues even when using large-v3, so I believe there is definitely a problem with whisper.

lgs777 commented 6 months ago

@jhj0517 Your efforts are always appreciated. Thank you for your feedback.

lgs777 commented 5 months ago

Silero VAD Filter is added in #153.

Open the "Advanced Parameters" tab in the WebUI, and check "Enable Silero VAD Filter". If the hallucination still occurs, uncheck "Condition On Previous Text".

If the hallucination still exists with the above methods, please let me know.

@jhj0517

The above method still causes problems. I don't have a problem with V2, but I have a problem with V3. I'm extracting Chinese subtitles.

jhj0517 commented 5 months ago

@lgs777 Thanks for pointing this out, I think this is a pretty notable issue. I'll just update the default model to large-v2 for now.

cookiexND commented 5 months ago

Thank you for all your help. I was having problems with hallucination when exporting Japanese conversations, but changing to large-v2 greatly improved the problem. I still had a little hallucination, but raising Temperrature to 0.2 eliminated it.

jhj0517 commented 2 months ago

I just added BGM separation pre processing to reduce such hallucinations in #267.

image

It gave me really better result on my test when the audio includes bgm, please feel free to share your result.

mark-wd commented 1 month ago

I was getting unusable translation results before turning on BGM separation and Silero VAD. It should be clarified that they are meant for this. A hint in the UI next to the translation toggle would go a VERY long way towards users understanding what these actually do.

jhj0517 commented 1 month ago

@mark-wd Thanks for pointing that out. I updated some labels for clearer use of submodels in #308.

If anyone has suggestions for better clarification, I'd appreciate it.