k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
931 stars 295 forks source link

Lost end of audio with jit_pretrained_streaming.py #1779

Open kfmn opened 3 weeks ago

kfmn commented 3 weeks ago

Hi,

I trained a streaming zipformer transducer on my data and converted the model to JIT by export.py with specific values of chunk_length and left_context_frames. Then I wanted to run streaming decoding using jit_pretrained_streaming.py and it seems this script does not decode the final part of audio.

First of all there is a misprint in https://github.com/k2-fsa/icefall/blob/f84270c93528f4b77b99ada9ac0c9f7fb231d6a4/egs/librispeech/ASR/zipformer/jit_pretrained_streaming.py#L218, it should be 0.25 second, not 0.2

Next, features are generated chunk-by-chunk and are decoded whenever the condition in https://github.com/k2-fsa/icefall/blob/f84270c93528f4b77b99ada9ac0c9f7fb231d6a4/egs/librispeech/ASR/zipformer/jit_pretrained_streaming.py#L234 is satisfied.

But if after last call to greedy_search this condition is not satisfied anymore, all computed features remain unprocessed. As a result, the decoding hypotheses are sometimes truncated

csukuangfj commented 2 weeks ago

First of all there is a misprint in

No, it is correct. You can select an arbitrary positive value for it. Its sole purpose is to simulate how fast the data samples arrive.

This value has nothing to do with your model parameters.


all computed features remain unprocessed. As a result, the decoding hypotheses are sometimes truncated

https://github.com/k2-fsa/icefall/blob/f84270c93528f4b77b99ada9ac0c9f7fb231d6a4/egs/librispeech/ASR/zipformer/jit_pretrained_streaming.py#L214

We have tail paddings here. You can use a larger tail padding if you find that the last chunk is not decoded.


By the way, please provide a concrete example with runnable code/script to reproduce your issue.

If you only look at the code without running it, I suggest that you run it first and then check whether what you think matches the actual result.

kfmn commented 2 weeks ago

I meant two things:

  1. In line chunk = int(0.25 * args.sample_rate) chunk length is set to 0.25 seconds but comment says 0.2 seconds, nothing more.
  2. I understand that increasing tail_padding solves the problem of lost frames in decoding. But it seems the hardcoded length of 0.3 seconds does not fit well to longer decoding chunks. For example, if I have chunk_size = 64, it corresponds to 128 frames, besides, the encoder.pad_length is added to obtain T=141 frames. It is much more than hardcoded 30 frames of tail_padding and it leads to loss of last real (not padded) frames. So my suggestion is to make tail_padding dependent on and consistent with chunk_size and encoder.pad_length