Short form evaluation WER % for Librispeech clean test

huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.

MIT License

3.32k stars 238 forks source link

Short form evaluation WER % for Librispeech clean test #88

Closed guynich closed 4 months ago

guynich commented 4 months ago

Hi, I'm enjoying working with this fascinating repo.

Looking at Stage 4 short form evaluation, I modified the short form evaluation bash script for Librispeech clean dataset (test split) for OpenAI Large-v2 model here and Small model here.

The generated WER % results are higher than the HuggingFace model card evaluation WER results which is unexpected.

E.g.:	model	script eval/wer	HF model card WER
OpenAI Large-v2	3.1683	3.0004
OpenAI Small	4.0682	3.4322

Any suggestions what might be causing these WER value differences (perhaps my short form eval bash scripts) ?

guynich commented 4 months ago

The above table is with --language "en" in the short form bash scripts. By removing this flag and rerunning the evaluation the eval/wer values are lower.

E.g.:	model	eval/wer with `--language "en"`	eval/wer without option `--language`	HF model card WER
OpenAI Large-v2	3.1683	2.5685	3.0004
OpenAI Small	4.0682	3.44541	3.4322

Without the --language flag:

Large-v2 model eval/wer is lower than the HuggingFace model card WER value, and lower than the original OpenAI paper result of 2.7% in Table 2.
Small model eval/wer is similar to the HuggingFace model card WER value.

guynich commented 4 months ago

Added Tiny model script and result here: https://github.com/guynich/distil-whisper/tree/main/training/scripts#summary.

guynich commented 4 months ago

I'm closing this issue: the small and tiny model results for HF model card and eval/wer without option --language are aligned sufficiently for me.

(I don't understand the discrepancy in values for Large-V2 but can leave that issue)