Decode vs Streaming Decode

teowenshen commented 1 year ago

I have a Japanese model using the streaming Zipformer trained with the command below, and I noticed a (~1.0%) difference between my decoding results with decode.py and streaming_decode.py.

./pruned_transducer_stateless7_streaming/train.py \ --context-size 3 \ --world-size 8 \ --num-epochs 30 \ --start-epoch 1 \ --use-fp16 1 \ --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_3 \ --max-duration 250 \ --transcript-mode disfluent \ --word-table lang_char_disfluent/words.txt \ --manifest-dir data/manifests_dis \ --musan-dir /mnt/host/corpus/musan/musan/fbank

I just noticed that the default values in decode.py and streaming_decode.py are different, so it's probably not related to the additions in the new recipe (?).

test set	decode.py	streaming_decode.py
eval1	6.47	5.7
eval2	5.04	4.31
eval3	5.59	4.63
excluded	7.33	6.15

Below are my decoding commands. The default values are the same as original, just that I've exposed params.res_dir and added params.gpu.

python pruned_transducer_stateless7_streaming/decode.py \ --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_3 \ --context-size 3 \ --epoch 30 \ --avg 10 \ --max-duration 150 \ --decoding-method fast_beam_search \ --manifest-dir data/manifests_dis \ --word-table lang_char_disfluent/words.txt \ --transcript-mode disfluent \ --res-dir pruned_transducer_stateless7_streaming/exp_disfluent_3/fast_chunk32_beam4_states32_contexts16 \ --decode-chunk-len 32 \ --beam 4 \ --max-states 32 \ --max-contexts 16 \ --gpu 0

python pruned_transducer_stateless7_streaming/streaming_decode.py \ --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_3 \ --context-size 3 \ --epoch 30 \ --avg 10 \ --max-duration 150 \ --decoding-method fast_beam_search \ --manifest-dir data/manifests_dis \ --word-table lang_char_disfluent/words.txt \ --transcript-mode disfluent \ --res-dir pruned_transducer_stateless7_streaming/exp_disfluent_3/fasts_chunk32_beam4_states32_contexts16 \ --decode-chunk-len 32 \ --beam 4 \ --max-states 32 \ --max-contexts 16 \ --gpu 0

Also, what do beam and max-states mean in fast_beam_decode? If I'm not mistaken, max-contexts limits the number of contexts returned by k2.RnntDecodingStreams.get_contexts() per utterance?

csukuangfj commented 1 year ago

Could you compare the errs-* file and see if there are any error patterns, e.g., are there many deletion errors at the end of an utterance?

teowenshen commented 1 year ago

Just for the record, I am comparing using the default settings of streaming_decode and decode. Please let me know if this comparison doesn't make sense.

streaming_decode/errs-eval1-txt-beam_4_max_contexts_4_max_states_32-epoch-30-avg-10-streaming-chunk-size-32-beam-4-max-contexts-4-max-states-32-use-averaged-model.txt
decode/errs-eval1-txt-beam_20.0_max_contexts_8_max_states_64-epoch-30-avg-10-streaming-chunk-size-32-beam-20.0-max-contexts-8-max-states-64-use-averaged-model.txt
(truncate) decode/errs-eval1-txt-beam_20.0_max_contexts_8_max_states_64-epoch-30-avg-10-streaming-chunk-size-32-beam-20.0-max-contexts-8-max-states-64-use-averaged-model.txt

where in 3 I removed the paddings in decode.py.

testset	streaming_decode	decode	(truncate) decode
eval1	5.91	6.35	7.04
eval2	4.29	4.73	5.69
eval3	4.55	5.16	6.04
excluded	6.11	7.24	7.86

There are noticeably more insertions at the end of utterances in the decode case, but (this time) when the padding is removed then the CER worsened due to deletions at the end and empty hypotheses.

When compared using the same decoding arguments:

streaming_decode/errs-eval1-txt-beam_4.0_max_contexts_16_max_states_32-epoch-30-avg-10-streaming-chunk-size-32-beam-4.0-max-contexts-16-max-states-32-use-averaged-model.txt
decode/errs-eval1-txt-beam_4.0_max_contexts_16_max_states_32-epoch-30-avg-10-streaming-chunk-size-32-beam-4.0-max-contexts-16-max-states-32-use-averaged-model.txt
(truncate) decode/errs-eval1-txt-beam_4.0_max_contexts_16_max_states_32-epoch-30-avg-10-streaming-chunk-size-32-beam-4.0-max-contexts-16-max-states-32-use-averaged-model.txt

testset	streaming_decode	decode	(truncate) decode
eval1	5.7	6.47	30.24
eval2	4.31	5.04	28.48
eval3	4.63	5.59	25.97
excluded	6.15	7.33	25.58

The terrible CER of (truncate) decode is due to many empty hypotheses. And, there are noticeably more insertions at the end of utterances for decode.

Just to confirm our previous finding, I retried decoding in modified_beam_search too, since that was my decoding method in pruned_transducer_stateless5.

testset	streaming_decode	decode	(truncate) decode
eval1	5.89	6.88	5.73
eval2	4.6	5.3	4.13
eval3	4.58	5.56	4.48
excluded	6.03	8.74	5.77

Indeed, similar to pruned_transducer_stateless5, truncated decoding yielded the best results when modified_beam_search is used.

Edit: Sorry, to summarise - am I using fast_beam_search wrongly? Do you have any suggestions to improve the results of fast_beam_search?

csukuangfj commented 1 year ago

Sorry, to summarise - am I using fast_beam_search wrongly? Do you have any suggestions to improve the results of fast_beam_search?

Could you change https://github.com/k2-fsa/icefall/blob/8642dbc0bd4174acb6612b6510f971f98a16f7d3/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py#L503 to

 lattice = decoding_streams.format_output(encoder_out_lens.tolist(), allow_partial=True)

and re-try?

See https://github.com/k2-fsa/k2/blob/master/k2/python/k2/rnnt_decode.py#L170

Note: You need to install the latest version of k2.

teowenshen commented 1 year ago

I tried your advice on both padded and non-padded cases.

padding	allow_partial	eval1	eval2	eval3	excluded
✅	✅	6.51	4.79	5.1	7.33
✅	❌	6.57	4.79	5.1	7.33
❌	✅	7.15	5.11	5.63	7.7
❌	❌	7.4	5.79	5.86	7.92

Under the normal (padded) case, there is not much difference. Also, there is still a noticeable gap from streaming_decode.py's 5.83/ 4.34/4.51/6.05.

For reference, my decoding command is as below.

python pruned_transducer_stateless7_streaming/decode.py \ --feedforward-dims "1024,1024,2048,2048,1024" \ --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_3 \ --context-size 3 \ --epoch 30 \ --avg 8 \ --max-duration 250 \ --decoding-method fast_beam_search \ --manifest-dir data/manifests_dis \ --word-table lang_char_disfluent/words.txt \ --transcript-mode disfluent \ --res-dir pruned_transducer_stateless7_streaming/exp_disfluent_3/issue807 \ --decode-chunk-len 32 \ --max-states 32 \ --max-contexts 8 \ --gpu 0

Below is my k2 version.

k2 version: 1.23.2 Build type: Release Git SHA1: a34171ed85605b0926eebbd0463d059431f4f74a Git date: Wed Dec 14 01:06:38 2022 Cuda used to build k2: 11.3 cuDNN used to build k2: 8.2.0 Python version used to build k2: 3.7 OS used to build k2: CMake version: 3.18.0 GCC version: 7.5.0

k2-fsa / icefall

Decode vs Streaming Decode #807