Open lgs777 opened 6 months ago
Hi, it seems like whisper hallucination.
Many of possible solutions are discussed here.
You can try
condition_on_previous_text
to Falseno_speech_threshold
and log_probability_threshold
values.You can adjust these parameters in the "Advanced Parameters" tab of the WebUI.
Setting condition_on_previous_text
to False would make texts less consistent about the context, but it will help to whisper to escape the "loop of failures" that you experienced.
no_speech_threshold
and log_probability_threshold
are the parameters that define how whisper will be "sensetive" to the small sounds. For example, in your case, this might happen because whisper is too sensitive to small sounds.
Increasing both no_speech_threshold
and log_probability_threshold
would make whisper insensitive to the small sounds.
*Instead of tweaking these parameters, I'll just add a vad_filter
parameter that enables the Silero VAD filter for easy use.
Silero VAD Filter is added in #153.
Open the "Advanced Parameters" tab in the WebUI, and check "Enable Silero VAD Filter". If the hallucination still occurs, uncheck "Condition On Previous Text".
If the hallucination still exists with the above methods, please let me know.
Increasing temperature also solves this
I have also recently encountered the same hallucination issue in Korean. Even when using the vad_filter
and adjusting the Advanced Parameters comprehensively, the same hallucination occurs after a certain point.
In my case, I found that changing the Model to large-v2
prevents hallucinations, although the text generation quality decreases.
Previously, there were no issues even when using large-v3
, so I believe there is definitely a problem with whisper.
@jhj0517 Your efforts are always appreciated. Thank you for your feedback.
Silero VAD Filter is added in #153.
Open the "Advanced Parameters" tab in the WebUI, and check "Enable Silero VAD Filter". If the hallucination still occurs, uncheck "Condition On Previous Text".
If the hallucination still exists with the above methods, please let me know.
@jhj0517
The above method still causes problems. I don't have a problem with V2, but I have a problem with V3. I'm extracting Chinese subtitles.
@lgs777 Thanks for pointing this out, I think this is a pretty notable issue.
I'll just update the default model to large-v2
for now.
Thank you for all your help. I was having problems with hallucination when exporting Japanese conversations, but changing to large-v2 greatly improved the problem. I still had a little hallucination, but raising Temperrature to 0.2 eliminated it.
I just added BGM separation pre processing to reduce such hallucinations in #267.
It gave me really better result on my test when the audio includes bgm, please feel free to share your result.
I was getting unusable translation results before turning on BGM separation and Silero VAD. It should be clarified that they are meant for this. A hint in the UI next to the translation toggle would go a VERY long way towards users understanding what these actually do.
@mark-wd Thanks for pointing that out. I updated some labels for clearer use of submodels in #308.
If anyone has suggestions for better clarification, I'd appreciate it.
Which OS are you using?
windows 11
After a long-awaited update, I attempted to generate Chinese subtitles. As time goes on, I'm encountering an issue where subtitles are generated as numbers only from a certain point.
1286 00:25:55,660 --> 00:25:56,660 90
1287 00:25:56,660 --> 00:25:57,660 90
1288 00:25:57,660 --> 00:25:58,660 90
1289 00:25:58,660 --> 00:25:59,660 90
1290 00:25:59,660 --> 00:26:00,660 90