Comparing oracle WER of CTC vs. transducer models for gigaspeech

huangruizhe commented 1 year ago

Hello, we are doing experiments on gigaspeech with the pretrained models. One of the experiments is to inspect and compare the oracle WER of CTC vs. transducers. Here is what we get:

CTC (decoding method: attention-decoder/nbest_oracle, num-paths: 1000)


dev	ngram_lm_scale_0.5_attention_scale_0.7	10.41
test	ngram_lm_scale_0.5_attention_scale_0.7	10.56
dev	oracle_1000_nbest_scale_0.6	5.48
test	oracle_1000_nbest_scale_0.6	5.89

Note that we have got slightly better CTC results (10.41&10.56) than here, because we tuned the best hlg_scale=0.52 for HLG graph, which is 1.0 in the default recipe.

Transducer (decoding method: modified_beam_search/fast_beam_search_nbest_oracle, num-paths: 1000)


dev	beam_size_4	10.41
test	beam_size_4	10.53
dev	oracle_1000_nbest_scale_0.4	6.75
test	oracle_1000_nbest_scale_0.4	7.26

We got slightly worse WER (10.41&10.53) than here, but notice that the oracle WER of transducer are much worse than that of CTC. They don't even seem comparable.

I was following the recipe here to get the oracle WER for the gigaspeech transducer. Was wondering (1) if it is normal that transducers can have a worse oracle WER than the CTC model (e.g., due to different implementations/mechanisms; or same story for librispeech or other corpus); (2) I forgot to tune some hyper params (I have done num-paths and nbest-scale); or (3) I have done something wrong.

csukuangfj commented 1 year ago

Are you using the latest master of k2?

huangruizhe commented 1 year ago

Maybe not the latest. I will check and get back to you.

k2-fsa / icefall

Comparing oracle WER of CTC vs. transducer models for gigaspeech #930