m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
12.4k stars 1.3k forks source link

Any recommended lines for better results? #534

Open ghost opened 1 year ago

ghost commented 1 year ago

Hello, this is not an issue.

I am currently using this line to run whisperx: whisperx --model large-v2 --language en --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4 --output_format srt

I want to know if I can make it more accurate for better results. Any recommended line additions and their explanations please?

I have a 3080 10gb, I think I am using the GPU, but I do not know. If there is a line to make it run on GPU, please let me know.

davidlandais commented 1 year ago

Install nvidia-smi, since your are using an Nvidia RTX GC. It will help you to check the memory load. Check that you have nvidia-smi : which nvidia-smi If you get the path, you can then use watch -n 1 nvidia-smi which will refresh the command output every 1s.

Start in an other terminal your command. Model should load in memory. GC should exhale air like an elephant with a loading near 12GB. Also, it is useless to use your parameter --align_model WAV2VEC2_ASR_LARGE_LV60K_960H as it is the default model. Check whisperx/alignment.py line 25.

I'm also using WhisperX with cuda. Here is my parameters : --model=large-v2 --device=cuda --device_index=0 --batch_size=16 --compute_type=float32

device and device_index are important. Reduce batch_size if memory is overloaded. Since you are using GPU, don't hesitate, try the best precision possible with float32.

If you want to check all of the parameters go here: whisperx/transcribe.py But you should try before whisperx --help.

ghost commented 1 year ago

Install nvidia-smi, since your are using an Nvidia RTX GC. It will help you to check the memory load. Check that you have nvidia-smi : which nvidia-smi If you get the path, you can then use watch -n 1 nvidia-smi which will refresh the command output every 1s.

Start in an other terminal your command. Model should load in memory. GC should exhale air like an elephant with a loading near 12GB. Also, it is useless to use your parameter --align_model WAV2VEC2_ASR_LARGE_LV60K_960H as it is the default model. Check whisperx/alignment.py line 25.

I'm also using WhisperX with cuda. Here is my parameters : --model=large-v2 --device=cuda --device_index=0 --batch_size=16 --compute_type=float32

device and device_index are important. Reduce batch_size if memory is overloaded. Since you are using GPU, don't hesitate, try the best precision possible with float32.

If you want to check all of the parameters go here: whisperx/transcribe.py But you should try before whisperx --help.

Thank you so much for your excellent explanation.

ghost commented 1 year ago

@davidlandais Is there a fix when the speech ends, but subs keep on going till the next speech?

davidlandais commented 1 year ago

I am not certain that I fully understand your request. Based on what you're telling me, here's what I imagine: You have a 2-minute video. During the first 10 seconds, a man (or woman) speaks. He/she speaks for 8 seconds. At the 10th second, someone responds. And you're wondering why, for 2 seconds after the first person has finished speaking, the subtitle continues to display. Honestly, I don't know. I don't have this problem; on the contrary, the subtitles disappear too quickly for my taste. If you have the ability to send me the audio file you're using (https://filebin.net/), I can try to run it through my system. Otherwise, I don't think it's much of a problem. It helps to continue and contribute to the understanding of the next subtitle line. Even if it stays for 2 seconds, at least you can make the mental connection between the previous subtitle and the new line. Looking forward to hearing from you.

ghost commented 1 year ago

I am not certain that I fully understand your request. Based on what you're telling me, here's what I imagine: You have a 2-minute video. During the first 10 seconds, a man (or woman) speaks. He/she speaks for 8 seconds. At the 10th second, someone responds. And you're wondering why, for 2 seconds after the first person has finished speaking, the subtitle continues to display. Honestly, I don't know. I don't have this problem; on the contrary, the subtitles disappear too quickly for my taste. If you have the ability to send me the audio file you're using (https://filebin.net/), I can try to run it through my system. Otherwise, I don't think it's much of a problem. It helps to continue and contribute to the understanding of the next subtitle line. Even if it stays for 2 seconds, at least you can make the mental connection between the previous subtitle and the new line. Looking forward to hearing from you.

I think what you've explained is the reason for this. It connects two sentences, and this usually happens when two speakers are speaking simultaneously. This is a really easy fix, and sometimes does not even need editing. Another easy to fix issue is, occasionally, subtitle shows up before the sound, like there is 5 seconds of subtitle, but speech starts at the 4th second.

However, sometimes there is a subtitle at one point of the video, but there is no speakers there, and this causes 4 to 5 subtitles to be wrongly placed, so I have to find the speeches to set them right. After finding the first speech, this is an easy to fix too.

I am guessing there is no preventing these, because these do not happen when speakers are speaking clear English and when there is no background noise to alter the words.

APISeeker commented 9 months ago

Hello to both of you, do you knowor have a method on how to transform the outputs json and dictionaires into a format understandable, such as SRT etc? I find the output too much