Discrepancy on WER benchmark result in Tedlium dataset

Hi.

I am unable to reproduce the benchmark results in the paper for test split in distil-whisper/tedlium using model distil-whisper/distil-large-v2 when using run_eval.py. However, I am able to achieve reasonable benchmark in all others dataset benchmark reported in the paper (< 1% difference). Any idea what could have caused this discrepencies ?

I followed the suggestions in issue 131 which suggested usage of EnglishTextNormalizer instead of BasicTextNormalizer .

Reported WER from paper: 9.6% Achieved WER : 12.69% Difference : 3.09%

Command :

python run_eval.py \
  --model_name_or_path "distil-whisper/distil-large-v2" \
  --dataset_name "distil-whisper/tedlium" \
  --dataset_config_name "release3" \
  --dataset_split_name "test" \
  --text_column_name "text" \
  --batch_size 64 \
  --dtype "bfloat16" \
  --generation_max_length 256 \
  --language "en" \
  --attn_implementation "flash_attention_2"

Modification : Used EnglishTextNormalizer as text normalizer

Thanks in advance.

huggingface / distil-whisper

Discrepancy on WER benchmark result in Tedlium dataset #135