huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.32k stars 238 forks source link

Short form evaluation WER % for Librispeech clean test #88

Closed guynich closed 4 months ago

guynich commented 4 months ago

Hi, I'm enjoying working with this fascinating repo.

Looking at Stage 4 short form evaluation, I modified the short form evaluation bash script for Librispeech clean dataset (test split) for OpenAI Large-v2 model here and Small model here.

The generated WER % results are higher than the HuggingFace model card evaluation WER results which is unexpected.

E.g.: model script eval/wer HF model card WER
OpenAI Large-v2 3.1683 3.0004
OpenAI Small 4.0682 3.4322

Any suggestions what might be causing these WER value differences (perhaps my short form eval bash scripts) ?

guynich commented 4 months ago

The above table is with --language "en" in the short form bash scripts. By removing this flag and rerunning the evaluation the eval/wer values are lower.

E.g.: model eval/wer with --language "en" eval/wer without option --language HF model card WER
OpenAI Large-v2 3.1683 2.5685 3.0004
OpenAI Small 4.0682 3.44541 3.4322

Without the --language flag:

guynich commented 4 months ago

Added Tiny model script and result here: https://github.com/guynich/distil-whisper/tree/main/training/scripts#summary.

guynich commented 4 months ago

I'm closing this issue: the small and tiny model results for HF model card and eval/wer without option --language are aligned sufficiently for me.

(I don't understand the discrepancy in values for Large-V2 but can leave that issue)