k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
930 stars 295 forks source link

Empty or incomplete hypotheses #667

Open ncakhoa opened 2 years ago

ncakhoa commented 2 years ago

When I trained conformer stateless streaming mode (transducer_stateless2), in the decode phases, I met a situation like this issue (https://github.com/k2-fsa/icefall/issues/403), that is decode with fast_beam_search_nbest_LG and LG graph gives a lot of empty hypotheses.

I tried to fix it by following the solution in (https://github.com/k2-fsa/icefall/issues/403), but didn't find any use-max argument.

csukuangfj commented 2 years ago

Are you using the latest k2 (i.e., the master branch of k2)?

ncakhoa commented 2 years ago

I use k2 version 1.19.dev20220922

csukuangfj commented 2 years ago

I use k2 version 1.19.dev20220922

Could you try the latest one from the master? https://k2-fsa.github.io/k2/installation/from_source.html

ncakhoa commented 2 years ago

I use k2 version 1.19.dev20220922

Could you try the latest one from the master? https://k2-fsa.github.io/k2/installation/from_source.html

I have tried but it didn't reduce empty hypotheses

ncakhoa commented 2 years ago

I also try greedy search, and it output exact tokens, so that I think the problem is in fast_beam_search_nbest_LG

armusc commented 1 year ago

Hi

has anyone else experienced something like this? I'm getting similar results where the LG graph is used in decoding, that is:

head -2 beam_search/errs-test-beam_size_4-epoch-50-avg-25-beam_search-beam-size-4.txt %WER = 17.32 Errors: 494 insertions, 738 deletions, 3337 substitutions, over 26379 reference words (22304 correct)

head -2 fast_beam_search/errs-test-beam_15.0_max_contexts_8_max_states_64-epoch-50-avg-25-beam-15.0-max-contexts-8-max-states-64.txt %WER = 18.22 Errors: 464 insertions, 1042 deletions, 3299 substitutions, over 26379 reference words (22038 correct)

head -2 greedy_search/errs-test-greedy_search-epoch-50-avg-25-context-2-max-sym-per-frame-1.txt %WER = 18.06 Errors: 465 insertions, 899 deletions, 3399 substitutions, over 26379 reference words (22081 correct)

head -2 modified_beam_search/errs-test-beam_size_4-epoch-50-avg-25-modified_beam_search-beam-size-4.txt %WER = 17.45 Errors: 484 insertions, 755 deletions, 3364 substitutions, over 26379 reference words (22260 correct)

head -2 fast_beam_search_nbest/errs-test-beam_15.0_max_contexts_8_max_states_64_num_paths_100_nbest_scale_0.5-epoch-50-avg-25-beam-15.0-max-contexts-8-max-states-64-nbest-scale-0.5-num-paths-100.txt %WER = 17.62 Errors: 485 insertions, 795 deletions, 3369 substitutions, over 26379 reference words (22215 correct)

head -2 fast_beam_search_nbest_LG/errs-test-beam_20.0_max_contexts_8_max_states_64_num_paths_200_nbest_scale_0.5_ngram_lm_scale_0.01-epoch-50-avg-25-beam-20.0-max-contexts-8-max-states-64-nbest-scale-0.5-num-paths-200-ngram-lm-scale-0.01.txt %WER = 21.19 Errors: 1131 insertions, 859 deletions, 3600 substitutions, over 26379 reference words (21920 correct)

there's a big drop in WER with fast_beam_search_nbest_LG, and no difference when a 2-gram or tri-gram is used I stress that all LG or HLG based decoding methods are especially useful for all these situations where the model needs adaptation on text-only data, and imposing word lexicon and arbitrary word pronunciations, which is a very common scenario in industrial applications

csukuangfj commented 1 year ago

Could you please check your errs-xxx file and see how many errors are caused by OOV words when LG is used?

armusc commented 1 year ago

Out of 26379 words of the eval corpus, there are 438 OOV word occurrences w.r.t. the word list in L, which is 1.66% OOV ratio now, a thumb rule I've been told in the past is that in closed vocabulary ASR, for every OOV you have 1.5 word errors because of side effects in recognition, so that 1.66% OOV ratio might (empirically) translate into a 2.5% absolute WER degradation then again, there are errors among those same OOVs also in the non-LG methods, so not everything is coming from that in the 17.62% WER from the fast_beam_search_nbest decoding, there are 272 errors from those same words (I just grepped the OOV word list into the sub-del-ins errors and summed)

-when I use HLG based decoding, let's say 1best or nbest method, in conformer_ctc, I have a more reasonable 18.7-18.8 WER with the same L and G -I am also surprised that using a G bigram or G trigram does not really change the result -I have a few cases where ending part of utterance is not decoded, which reminded me of this thread and that does not seem to happen with the other methods, but this has happened only occasionally so I cannot really generalize this observation

ex: 1e15a26c-6a37-45c4-abd5-c62eba481801: ref=['de', 'nieuwe', 'programmatieregeling', 'om', 'dit', 'mogelijk', 'te', 'maken'] 1e15a26c-6a37-45c4-abd5-c62eba481801: hyp=['de', 'nieuwe', "programma's", 'om', 'dit', 'hoofd']

293b82f6-2407-4d08-8a27-93dc690c2313: ref=['dat', 'zou', 'niet', 'nodig', 'zijn', 'als', 'hij', 'in', 'deze', 'cockpit', 'zou', 'vliegen'] 293b82f6-2407-4d08-8a27-93dc690c2313: hyp=['dat', 'zou', 'niet', 'nodig', 'zijn', 'als', 'die', 'in', 'deze', 'kop'] 10681284-d597-4ca4-9ae6-e9b5d633231c: ref=['in', 'de', 'toekomst', 'willen', 'wij', 'absoluut', 'die', 'domeinscholen'] 10681284-d597-4ca4-9ae6-e9b5d633231c: hyp=['in', 'de', 'toekomst', 'willen', 'wij', 'absolute', 'doet'] which I do not in fast_beam_search_nbest (not LG based)

now, maybe what I could do is train your latest model, where encoder-ctc output is combined with transducer and the HLG decoding can be done on the ctc output and see what I obtain

csukuangfj commented 1 year ago

-when I use HLG based decoding, let's say 1best or nbest method, in conformer_ctc, I have a more reasonable 18.7-18.8 WER with the same L and G -I am also surprised that using a G bigram or G trigram does not really change the result

Do you mean it is not helpful for HLG decoding?

armusc commented 1 year ago

that remark referred to fast_beam_search_nbest_LG when I use a G bigram or trigram has not changed results

when I use in G in first pass decoding in HLG with the conformer_ctc and then rescore with a 4-Gram (for example, whole-lattice-rescoring) results are improved (let's say 7-8% relative improvement, in this specific case)