jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.57k stars 174 forks source link

Processing Strategy for 11-Hour MP3 #407

Closed zxl777 closed 3 weeks ago

zxl777 commented 4 weeks ago

I found that directly inputting very long audio files significantly increases the error rate. If the audio is split and processed separately in stable-ts, and the outputs are merged afterward, much better results can be achieved.

Additionally, GPU memory usage has noticeably decreased, allowing for the simultaneous processing of more than two audio files, which improves efficiency.

====== When inputting an 11-hour MP3 file, the error rate increases significantly, and several lines were missed in the transcription of the first 20 minutes.

time whisper '11hours.mp3' --model turbo --output_format json --word_timestamps True

However, when I extracted the first 20 minutes of audio and transcribed it again, everything worked fine.

ffmpeg -i 11hours.mp3 -t 1200 -acodec copy 20_minutes.mp3

time whisper 20_minutes.mp3 --model turbo --output_format json --word_timestamps True

jianfch commented 4 weeks ago

The audio is only loaded in 30 seconds at a time by default except for Faster-Whisper and HF models and transcribe_minimal(). So as long as the the audio file is at least 30 seconds long, it should take up about the same amount memory as a 30 second clip except if that particular audio file or format cannot be loaded in chunks by ffmpeg. The audio is also transcribed in the 30 second chunks by default. This means it transcribes the same chunks for a 20min clip and the first 20min of a 11 hour file. So the transcription should be same for those 20mins. Unless it wasn't loaded in chunks, then the preprocessing could've caused those different results. If that's the case, using --beam_size 5 should make the results more consistent. I noticed you're also running time whisper. If you're running stable-ts within a script, we might be able to find the root issue if you share the full script or all the exact arguments passed to stable-ts.

zxl777 commented 4 weeks ago

Thank you for your suggestion. I tried transcribe_minimal and fast whisper, but both had the same issue with an 11-hour audio file — a sentence was missed. There was only a slight pause on both sides of the sentence, so it shouldn’t have been skipped. When I switched to a 20-minute audio segment, everything worked fine.

I suspect that the issue is related to loading a large audio file into memory or GPU memory, causing unexpected behavior. My GPU has 8GB of RAM, which should be sufficient.

zxl777 commented 4 weeks ago

After setting condition_on_previous_text = False, the missing sentences came back.

By default, when condition_on_previous_text = True, Whisper may accumulate errors on large files, potentially leading to missing lines.

However, using condition_on_previous_text = False resulted in issues with overly long sentences and punctuation.

jianfch commented 4 weeks ago

Generally, the prefix/prompt does not cause the model to skip lines. It might be an edge case where the model is sensitive to certain audio content such that slight changes (e.g. prefix/prompt) cause transcription errors. Try to use beam search, see if helps (e.g. beam_size=5).

However, using condition_on_previous_text = False resulted in issues with overly long sentences and punctuation.

If beam search does not work or condition_on_previous_text=False give you the best results you can try the following: https://github.com/jianfch/stable-ts/blob/4c6e138922f94a48797cc9d82e4a54e2cc9b57d3/stable_whisper/whisper_word_level/cli.py#L236-L239

To do more complex splits, you can use a custom regrouping algorithm.

zxl777 commented 4 weeks ago

After continuous attempts, I found that if I want to generate a JSON file with timestamps, adding an initial_prompt brings back the missing sentence.

initial_prompt='They internalize the cultural message of “It’s your fault! You should exercise more, but you aren’t doing it. Shame on you!” I am here to say: It isn’t your fault.'

Additionally, if I only generate plain text without timestamps, the result is also correct. time whisper '11hours.mp3' --model turbo --output_format txt