huggingface / distil-whisper

Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
MIT License
3.32k stars 236 forks source link

Discrepancy on WER benchmark result in Tedlium dataset #135

Open MLMonkATGY opened 3 weeks ago

MLMonkATGY commented 3 weeks ago

Hi.

I am unable to reproduce the benchmark results in the paper for test split in distil-whisper/tedlium using model distil-whisper/distil-large-v2 when using run_eval.py. However, I am able to achieve reasonable benchmark in all others dataset benchmark reported in the paper (< 1% difference). Any idea what could have caused this discrepencies ?

I followed the suggestions in issue 131 which suggested usage of EnglishTextNormalizer instead of BasicTextNormalizer .

Reported WER from paper: 9.6% Achieved WER : 12.69% Difference : 3.09%

Command :

python run_eval.py \
  --model_name_or_path "distil-whisper/distil-large-v2" \
  --dataset_name "distil-whisper/tedlium" \
  --dataset_config_name "release3" \
  --dataset_split_name "test" \
  --text_column_name "text" \
  --batch_size 64 \
  --dtype "bfloat16" \
  --generation_max_length 256 \
  --language "en" \
  --attn_implementation "flash_attention_2" 

Modification : Used EnglishTextNormalizer as text normalizer

Thanks in advance.

bryanyzhu commented 3 weeks ago

I'm facing the same issue, only tedium has this discrepancy.