jianfch / stable-ts

Transcription, forced alignment, and audio indexing with OpenAI's Whisper
MIT License
1.56k stars 172 forks source link

Out of Memory Errors with ~13GB of ram free. #79

Closed kanjieater closed 1 year ago

kanjieater commented 1 year ago

19 hour file around 1GB in size results in killed for OOM error. I'm running with 13GB available. image

It happens when I run with this command. It works fine for a smaller input mp3 & whisper and whisperX both manage to run this without OOM errors. stable-ts "$FOLDER/audio.mp3" --language Japanese --output_dir "$FOLDER/" --model large-v2 -o "$FOLDER/captions.ass"

Is there any fixes that could be or workarounds available? I'm guessing I could use a less accurate model (though I was hoping not to).

Update: I also tried it with 20GB available & --model medium set. It resulted in the same thing

jianfch commented 1 year ago

You can try to lower the --refine_ts_num (default: 100). Or just disable refinement with --refine_ts_num 0.

kanjieater commented 1 year ago

You can try to lower the --refine_ts_num (default: 100). Or just disable refinement with --refine_ts_num 0.

Thanks - I'll give it a try. Could you explain more about how that parameter affects the model so I can tune it accurately? If I disable it with 0, what will be the impact?

jianfch commented 1 year ago

So it seems refine_ts_num doesn't have a significant effect on memory usage. But there does appear to be a surge in memory usage when loading the model with default whisper function. This surge elevates the baseline memory usage. This surge should be fixed in 0b423391e115abcb8b8fdbb581b75f5b1fc746d3. added --sync_empty which can also reduce memory usage during inference.

kanjieater commented 1 year ago

Thank you for the quick response. I tried your suggestion and latest version. Unfortunately, there was no change, as the memory still filled up quickly

8737 Killed stable-ts "$FOLDER/audio.mp3" --language Japanese --output_dir "$FOLDER/" --model large-v2 -o "$FOLDER/captions.ass" --sync_empty image

The memory starts lower for a time, then around that peak it crashes, it's not an immediate crash but it is within a 3 minutes.

jianfch commented 1 year ago

My apologies, I misread the issue. I was assuming we were talking about GPU memory. The previous solution only works for GPU memory. It is expected that stable-ts has higher CPU memory usage than official whisper and other implementations because it stores significantly more data (in RAM) for stabilizing the timestamps. The spike and crash you're seeing might be due to the stable-ts trying to generate a timestamp mask for your the entire audio track at once. So this spike is likely before inference (--verbose should tell you if there is not text output to the console before it crashes). If this is the case, --suppress_silence False should drastically lower the RAM usage.

kanjieater commented 1 year ago

I didn't see any output when running with the --verbose command. 19625 Killed stable-ts "$FOLDER/audio.mp3" --language Japanese --output_dir "$FOLDER/" --model large-v2 -o "$FOLDER/captions.ass" --sync_empty --verbose

I will try removing the sync_empty flag, and running again to see if verbose shows anything (accidentally left it in. I'll try running with the --suppress_silence False as well.

Update: Verbose didn't output anything unfortunately 20378 Killed stable-ts "$FOLDER/audio.mp3" --language Japanese --output_dir "$FOLDER/" --model large-v2 -o "$FOLDER/captions.ass" --verbose

I also ran it with suppress_silence, and got the same result 22053 Killed stable-ts "$FOLDER/audio.mp3" --language Japanese --output_dir "$FOLDER/" --model large-v2 -o "$FOLDER/captions.ass" --suppress_silence false --overwrite

Memory usage and CPU usage spike at the same time when the Out of Memory error occurs.

Just to be clear, my specs are: i9-13900ks 4070TI 32GB DDR5 ram

All of this is stable and working well. It runs inside of WSL2 on Win11 (which has access to CPU, GPU and RAM - works fine for whisper and whisperx as far as resources). I've allocated additional memory as well: image

Would you like me to send you the 1GB file somewhere so you could see if you can reproduce as well? I can run it successfully for smaller files.

kanjieater commented 1 year ago

I just started a run on a 6 hour wav file that is 700mb. The progress bar started very quickly. The progress bar never showed for my 19hr 1GB file and always crashed.

Update: The 6 hour wav completed w/o issue.

jianfch commented 1 year ago

If you still see a spike even with --suppress_silence false. Then the spike is likely from whisper.log_mel_spectrogram which the default part of whisper loading the audio. Passing a 19hr long array into whisper.log_mel_spectrogram causes 23GB spike on my end. I suggest splitting that audio track down to shorter tracks.

import whisper
mel = whisper.log_mel_spectrogram('audio.mp3')
kanjieater commented 1 year ago

You are correct. The input file is too large when Whisper starts, so I either need more RAM or for Whisper to fix it upstream. Thank you for your help with this.