m-bain / whisperX

WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
BSD 2-Clause "Simplified" License
11.95k stars 1.26k forks source link

Somewhat unclear instructions (Readme.md) regarding alignment model size #256

Open 7k50 opened 1 year ago

7k50 commented 1 year ago

My aim is to get relatively good timestamp accuracy (good/adequate but doesn't have to be "perfect"), but the instructions are somewhat unclear to me. Readme.md says:

For increased timestamp accuracy, at the cost of higher gpu mem, use bigger models (bigger alignment model not found to be that helpful, see paper) e.g.

whisperx examples/sample01.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4

I am assuming that the above means to suggest these settings for good timestamp accuracy, so in other words, WAV2VEC2_ASR_LARGE_LV60K_960H is a good choice? Or does it mean to say that the addition of this align_model is not really that useful, but that the addition of large-v2 is?

Furthermore, what may be appropriate settings for batch_size (and possibly beam_size) if the goal is to have relatively good timestamp accuracy?

sorgfresser commented 1 year ago

large-v2 does increase the accuracy of the transcript itself, while a greater alignment model does increase the accuracy of the timestamp. A greater batch size only affects the transcript, a beam size does so too. As long as the quality of the transcript is not too bad, there shouldn't be any effect of batch size, beam size and large-v2 on the alignment. Personally I have not experienced any improvement with a better align model used for forced alignment. Still, If you want the best accuracy, use a greater align model (for example WAV2VEC2_ASR_LARGE_LV60K_960H). I'd say if you only want a good timestamp, the default option is good enough. But this is up for you to decide.

skartekko commented 1 year ago

@sorgfresser Thanks for the infos! I understand that increasing batch size and beam size may speed up processing. But does increasing batch size and beam size affect the qualitiy of Whisper transcription?

sorgfresser commented 1 year ago

TLDR: Beam size yes, Batch size no.

Beam size surely does. The batch size is a bit more tricky - if we use batching, we can't utilize the prompt parameter in the same way OpenAI does. According to the author of this repo it does not affect accuracy negatively. You can read the whole discussion on this in #234 Since beam size is simply a bit less greedy, it will affect it in a positive way but will require additional computation, so transcribing will take longer. Still you should note that which beam size is best depends a bit on the beam size used for training (not too much, but it can have an impact).