Open teowenshen opened 1 year ago
Could you compare the errs-* file and see if there are any error patterns, e.g., are there many deletion errors at the end of an utterance?
Just for the record, I am comparing using the default settings of streaming_decode and decode. Please let me know if this comparison doesn't make sense.
where in 3 I removed the paddings in decode.py.
testset | streaming_decode | decode | (truncate) decode |
---|---|---|---|
eval1 | 5.91 | 6.35 | 7.04 |
eval2 | 4.29 | 4.73 | 5.69 |
eval3 | 4.55 | 5.16 | 6.04 |
excluded | 6.11 | 7.24 | 7.86 |
There are noticeably more insertions at the end of utterances in the decode case, but (this time) when the padding is removed then the CER worsened due to deletions at the end and empty hypotheses.
When compared using the same decoding arguments:
testset | streaming_decode | decode | (truncate) decode |
---|---|---|---|
eval1 | 5.7 | 6.47 | 30.24 |
eval2 | 4.31 | 5.04 | 28.48 |
eval3 | 4.63 | 5.59 | 25.97 |
excluded | 6.15 | 7.33 | 25.58 |
The terrible CER of (truncate) decode is due to many empty hypotheses. And, there are noticeably more insertions at the end of utterances for decode.
Just to confirm our previous finding, I retried decoding in modified_beam_search too, since that was my decoding method in pruned_transducer_stateless5
.
testset | streaming_decode | decode | (truncate) decode |
---|---|---|---|
eval1 | 5.89 | 6.88 | 5.73 |
eval2 | 4.6 | 5.3 | 4.13 |
eval3 | 4.58 | 5.56 | 4.48 |
excluded | 6.03 | 8.74 | 5.77 |
Indeed, similar to pruned_transducer_stateless5
, truncated decoding yielded the best results when modified_beam_search
is used.
Edit:
Sorry, to summarise - am I using fast_beam_search
wrongly? Do you have any suggestions to improve the results of fast_beam_search
?
Sorry, to summarise - am I using fast_beam_search wrongly? Do you have any suggestions to improve the results of fast_beam_search?
Could you change https://github.com/k2-fsa/icefall/blob/8642dbc0bd4174acb6612b6510f971f98a16f7d3/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py#L503 to
lattice = decoding_streams.format_output(encoder_out_lens.tolist(), allow_partial=True)
and re-try?
See https://github.com/k2-fsa/k2/blob/master/k2/python/k2/rnnt_decode.py#L170
Note: You need to install the latest version of k2.
I tried your advice on both padded and non-padded cases.
padding | allow_partial | eval1 | eval2 | eval3 | excluded |
---|---|---|---|---|---|
✅ | ✅ | 6.51 | 4.79 | 5.1 | 7.33 |
✅ | ❌ | 6.57 | 4.79 | 5.1 | 7.33 |
❌ | ✅ | 7.15 | 5.11 | 5.63 | 7.7 |
❌ | ❌ | 7.4 | 5.79 | 5.86 | 7.92 |
Under the normal (padded) case, there is not much difference.
Also, there is still a noticeable gap from streaming_decode.py
's 5.83/ 4.34/4.51/6.05.
For reference, my decoding command is as below.
python pruned_transducer_stateless7_streaming/decode.py \ --feedforward-dims "1024,1024,2048,2048,1024" \ --exp-dir pruned_transducer_stateless7_streaming/exp_disfluent_3 \ --context-size 3 \ --epoch 30 \ --avg 8 \ --max-duration 250 \ --decoding-method fast_beam_search \ --manifest-dir data/manifests_dis \ --word-table lang_char_disfluent/words.txt \ --transcript-mode disfluent \ --res-dir pruned_transducer_stateless7_streaming/exp_disfluent_3/issue807 \ --decode-chunk-len 32 \ --max-states 32 \ --max-contexts 8 \ --gpu 0
Below is my k2 version.
k2 version: 1.23.2 Build type: Release Git SHA1: a34171ed85605b0926eebbd0463d059431f4f74a Git date: Wed Dec 14 01:06:38 2022 Cuda used to build k2: 11.3 cuDNN used to build k2: 8.2.0 Python version used to build k2: 3.7 OS used to build k2: CMake version: 3.18.0 GCC version: 7.5.0
I have a Japanese model using the streaming Zipformer trained with the command below, and I noticed a (~1.0%) difference between my decoding results with decode.py and streaming_decode.py.
I just noticed that the default values in decode.py and streaming_decode.py are different, so it's probably not related to the additions in the new recipe (?).
Below are my decoding commands. The default values are the same as original, just that I've exposed
params.res_dir
and addedparams.gpu
.Also, what do
beam
andmax-states
mean in fast_beam_decode? If I'm not mistaken,max-contexts
limits the number of contexts returned byk2.RnntDecodingStreams.get_contexts()
per utterance?