k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
921 stars 294 forks source link

Comparing oracle WER of CTC vs. transducer models for gigaspeech #930

Open huangruizhe opened 1 year ago

huangruizhe commented 1 year ago

Hello, we are doing experiments on gigaspeech with the pretrained models. One of the experiments is to inspect and compare the oracle WER of CTC vs. transducers. Here is what we get:

CTC (decoding method: attention-decoder/nbest_oracle, num-paths: 1000)

dev ngram_lm_scale_0.5_attention_scale_0.7 10.41
test ngram_lm_scale_0.5_attention_scale_0.7 10.56
dev oracle_1000_nbest_scale_0.6 5.48
test oracle_1000_nbest_scale_0.6 5.89

Note that we have got slightly better CTC results (10.41&10.56) than here, because we tuned the best hlg_scale=0.52 for HLG graph, which is 1.0 in the default recipe.

Transducer (decoding method: modified_beam_search/fast_beam_search_nbest_oracle, num-paths: 1000)

dev beam_size_4 10.41
test beam_size_4 10.53
dev oracle_1000_nbest_scale_0.4 6.75
test oracle_1000_nbest_scale_0.4 7.26

We got slightly worse WER (10.41&10.53) than here, but notice that the oracle WER of transducer are much worse than that of CTC. They don't even seem comparable.

I was following the recipe here to get the oracle WER for the gigaspeech transducer. Was wondering (1) if it is normal that transducers can have a worse oracle WER than the CTC model (e.g., due to different implementations/mechanisms; or same story for librispeech or other corpus); (2) I forgot to tune some hyper params (I have done num-paths and nbest-scale); or (3) I have done something wrong.

csukuangfj commented 1 year ago

Are you using the latest master of k2?

huangruizhe commented 1 year ago

Maybe not the latest. I will check and get back to you.