I am unable to reproduce the benchmark results in the paper for test split in distil-whisper/tedlium using model distil-whisper/distil-large-v2 when using run_eval.py. However, I am able to achieve reasonable benchmark in all others dataset benchmark reported in the paper (< 1% difference). Any idea what could have caused this discrepencies ?
I followed the suggestions in issue 131 which suggested usage of EnglishTextNormalizer instead of BasicTextNormalizer .
Reported WER from paper: 9.6%
Achieved WER : 12.69%
Difference : 3.09%
Hi.
I am unable to reproduce the benchmark results in the paper for test split in
distil-whisper/tedlium
using modeldistil-whisper/distil-large-v2
when usingrun_eval.py
. However, I am able to achieve reasonable benchmark in all others dataset benchmark reported in the paper (< 1% difference). Any idea what could have caused this discrepencies ?I followed the suggestions in issue 131 which suggested usage of
EnglishTextNormalizer
instead ofBasicTextNormalizer
.Reported WER from paper: 9.6% Achieved WER : 12.69% Difference : 3.09%
Command :
Modification : Used
EnglishTextNormalizer
as text normalizerThanks in advance.