Closed sdrobert closed 3 years ago
Indeed. And in addition to that, they perform chunk normalization. So far that's not a problem, but they treat the test set as a whole chunk and normalize the dev and test set with means and variances that are calculated over the whole respective sets.
That's as far as I know another huge difference from the standard evaluation.
Hi there,
While the top TIMIT scores of 13.8% and 14.9% are reproducible, they perform a non-standard evaluation wherein silence phones are removed from reference and hypothesis transcripts (https://github.com/mravanelli/pytorch-kaldi/blob/6234b86df5ea65fe61091519d27358177b04a198/kaldi_decoding_scripts/local/score.sh). The result is a non-negligible decrease in PER. For reference, when Kaldi went back to including silences in its eval, here were its results https://github.com/kaldi-asr/kaldi/commit/bdd752b4bf95079851bac53a22b987d1487a8899.
Best, Sean