What are the settings used for WER calculation in the paper?

huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.

MIT License

3.33k stars 238 forks source link

Yes, we evaluated using greedy search with no sampling. For beam size = 5, we see the following (with the abs WER reduction vs greedy):

Whisper-Large-v2 with num_beams=5

Distil-Whisper with num_beams=5

Relative speed-up of Distil-Whisper to Whisper for increasing batch size (bsz):

=> speed-ups are very similar to what we achieved without beam search

huggingface / distil-whisper