Closed zxl777 closed 1 month ago
The audio is only loaded in 30 seconds at a time by default except for Faster-Whisper and HF models and transcribe_minimal()
. So as long as the the audio file is at least 30 seconds long, it should take up about the same amount memory as a 30 second clip except if that particular audio file or format cannot be loaded in chunks by ffmpeg.
The audio is also transcribed in the 30 second chunks by default. This means it transcribes the same chunks for a 20min clip and the first 20min of a 11 hour file. So the transcription should be same for those 20mins. Unless it wasn't loaded in chunks, then the preprocessing could've caused those different results. If that's the case, using --beam_size 5
should make the results more consistent.
I noticed you're also running time whisper
. If you're running stable-ts within a script, we might be able to find the root issue if you share the full script or all the exact arguments passed to stable-ts.
Thank you for your suggestion. I tried transcribe_minimal and fast whisper, but both had the same issue with an 11-hour audio file — a sentence was missed. There was only a slight pause on both sides of the sentence, so it shouldn’t have been skipped. When I switched to a 20-minute audio segment, everything worked fine.
I suspect that the issue is related to loading a large audio file into memory or GPU memory, causing unexpected behavior. My GPU has 8GB of RAM, which should be sufficient.
After setting condition_on_previous_text = False, the missing sentences came back.
By default, when condition_on_previous_text = True, Whisper may accumulate errors on large files, potentially leading to missing lines.
However, using condition_on_previous_text = False resulted in issues with overly long sentences and punctuation.
Generally, the prefix/prompt does not cause the model to skip lines. It might be an edge case where the model is sensitive to certain audio content such that slight changes (e.g. prefix/prompt) cause transcription errors. Try to use beam search, see if helps (e.g. beam_size=5
).
However, using condition_on_previous_text = False resulted in issues with overly long sentences and punctuation.
If beam search does not work or condition_on_previous_text=False
give you the best results you can try the following:
https://github.com/jianfch/stable-ts/blob/4c6e138922f94a48797cc9d82e4a54e2cc9b57d3/stable_whisper/whisper_word_level/cli.py#L236-L239
To do more complex splits, you can use a custom regrouping algorithm.
After continuous attempts, I found that if I want to generate a JSON file with timestamps, adding an initial_prompt brings back the missing sentence.
initial_prompt='They internalize the cultural message of “It’s your fault! You should exercise more, but you aren’t doing it. Shame on you!” I am here to say: It isn’t your fault.'
Additionally, if I only generate plain text without timestamps, the result is also correct.
time whisper '11hours.mp3' --model turbo --output_format txt
I found that directly inputting very long audio files significantly increases the error rate. If the audio is split and processed separately in stable-ts, and the outputs are merged afterward, much better results can be achieved.
Additionally, GPU memory usage has noticeably decreased, allowing for the simultaneous processing of more than two audio files, which improves efficiency.
====== When inputting an 11-hour MP3 file, the error rate increases significantly, and several lines were missed in the transcription of the first 20 minutes.
time whisper '11hours.mp3' --model turbo --output_format json --word_timestamps True
However, when I extracted the first 20 minutes of audio and transcribed it again, everything worked fine.
ffmpeg -i 11hours.mp3 -t 1200 -acodec copy 20_minutes.mp3
time whisper 20_minutes.mp3 --model turbo --output_format json --word_timestamps True