Request for note next to pytorch-kaldi TIMIT results

sdrobert commented 4 years ago

Hi there,

While the top TIMIT scores of 13.8% and 14.9% are reproducible, they perform a non-standard evaluation wherein silence phones are removed from reference and hypothesis transcripts (https://github.com/mravanelli/pytorch-kaldi/blob/6234b86df5ea65fe61091519d27358177b04a198/kaldi_decoding_scripts/local/score.sh). The result is a non-negligible decrease in PER. For reference, when Kaldi went back to including silences in its eval, here were its results https://github.com/kaldi-asr/kaldi/commit/bdd752b4bf95079851bac53a22b987d1487a8899.

Best, Sean

timolohrenz commented 3 years ago

Indeed. And in addition to that, they perform chunk normalization. So far that's not a problem, but they treat the test set as a whole chunk and normalize the dev and test set with means and variances that are calculated over the whole respective sets.

That's as far as I know another huge difference from the standard evaluation.

syhw commented 3 years ago

Fixed with https://github.com/syhw/wer_are_we/commit/ffbf4d08fdad3b6b15aaa67d112b35fe58202a9e

syhw / wer_are_we

Request for note next to pytorch-kaldi TIMIT results #47