linto-ai / whisper-timestamped

Multilingual Automatic Speech Recognition with word-level timestamps and confidence
GNU Affero General Public License v3.0
1.87k stars 149 forks source link

Repetitive Phrase Looping #171

Open noahuser opened 7 months ago

noahuser commented 7 months ago

I've been using Whisper-timestamped for some time and it worked flawlessly. However, after a few months during which I updated my Mac to Sonoma, I've encountered a recurring issue upon returning to use the tool. The transcription process appears to proceed normally, with the loading bar reaching 100% as expected. Yet, at a certain point, the transcription process gets stuck and begins looping the same sentence over and over again until the end of the audio file. For instance, at 00:38:03, it transcribes a sentence and then repeats this sentence in a loop until 01:30:03, which is when the audio ends. Initially, I suspected an issue with the audio file itself, but the problem persists across different audio files, including one that was previously transcribed perfectly a few months ago. Interestingly, the exact timing of when the loop starts varies with each attempt. I am at a loss on how to resolve this issue. Does anyone have any suggestions or insights?

I have already tried to enable VAD, nothing changed. I already tried to uninstall and reinstall whisper-timestamped, nothing changed.

Jeronymous commented 6 months ago

This seems to be a duplicate of https://github.com/linto-ai/whisper-timestamped/issues/94 You have some suggestion there.

Repetitions are due to model hallucination. Which model are you using?

noahuser commented 6 months ago

I am using the large model. I already tried everything in #94. I have two MacBooks, one Intel i7 the other one M2 Pro. I tried the same audio with the Intel one, it functions perfectly w/out any Issue. The one with M2 Pro does this hallucinations every time. In the M2 one I "solved" the problem putting this code on my Python (result = transcribe_timestamped(model, audio_file, beam_size=5, best_of=5, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0))). But usually it takes 1 hour to transcribe the audio, with this kind of setting it actually functions very well, but it takes something like 6 hours pro audio. That's not a big problem, but when It functioned before in an hour it was really beautiful :)

Jeronymous commented 6 months ago

OK, when you say "I use the large model", you have to know there are several versions of the large model (now there are 3). So if you use model = whisper.load_model("large") in your code, without specifying the version, that might load the latest version. Which could explain why you experienced a difference of behaviour suddenly. You can specify the exact version to use, with, e.g. model = whisper.load_model("large-v1") (or "large-v2" whatever was the last one when "it worked flawlessly")