k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
902 stars 287 forks source link

questions about cutting off frames in conformer chunk forward #867

Open Cescfangs opened 1 year ago

Cescfangs commented 1 year ago

https://github.com/k2-fsa/icefall/blob/e36ea89112bb3d81602cb4df51bd68e6d06dc150/egs/wenetspeech/ASR/pruned_transducer_stateless5/conformer.py#L340-L346

As I see it, the encoder_embed module works like that(setting input len =15 for example): cut_off

If we cut off two frames from the output, 6 frames of input will not be used, could there be any information lost?

yaozengwei commented 1 year ago

We cut off the first and the last frames during streaming forward, since we have padding=1 in the first conv layer in the Conv2dSubsampling module: https://github.com/k2-fsa/icefall/blob/e36ea89112bb3d81602cb4df51bd68e6d06dc150/egs/wenetspeech/ASR/pruned_transducer_stateless5/conformer.py#L1464-L1469 When getting a chunk of frames, we would pad extra 2 * subsampling_factor frames for this: https://github.com/k2-fsa/icefall/blob/e36ea89112bb3d81602cb4df51bd68e6d06dc150/egs/wenetspeech/ASR/pruned_transducer_stateless5/decode_stream.py#L78 So it does not cause information lost. You could refer to decode_stream.py to see how we get chunks: https://github.com/k2-fsa/icefall/blob/e36ea89112bb3d81602cb4df51bd68e6d06dc150/egs/wenetspeech/ASR/pruned_transducer_stateless5/decode_stream.py#L121

Cescfangs commented 1 year ago

thanks for the reply, I understand first conv module need extra input frames for padding, taking chunk_size(after subsampling) = 1 for similicity, I think ((1 2) + 1)2 + 1 + 2(padding for first conv) = 9 frames is enough, so it could work like this:

  1. first chunk, input 8 frames(left padding zero is ok I think), output first frame
  2. add another 4 frames to get 2nd output frame
  3. ...

and there's no need to cut off, do I make some mistake here?

pkufool commented 1 year ago

@Cescfangs To make the behavior the same as training time, I think we have to do cutting off, for each chunk the last frame of output (I mean the top layer in your graph) sees the padding value, that is different from training time, during training time the internal chunks never see padding.

yaozengwei commented 1 year ago

thanks for the reply, I understand first conv module need extra input frames for padding, taking chunk_size(after subsampling) = 1 for similicity, I think ((1 2) + 1)2 + 1 + 2(padding for first conv) = 9 frames is enough, so it could work like this:

Suppose we want to get C frame of output after subsampling, we feed (C + 2) * 4 + 3 input frames, and get (C + 2) output frames. Then we drop the first one and the last one, since they see zero padding (in the first conv).

Cescfangs commented 1 year ago

thanks for the reply, I understand first conv module need extra input frames for padding, taking chunk_size(after subsampling) = 1 for similicity, I think ((1 2) + 1)2 + 1 + 2(padding for first conv) = 9 frames is enough, so it could work like this:

  1. first chunk, input 8 frames(left padding zero is ok I think), output first frame
  2. add another 4 frames to get 2nd output frame
  3. ...

and there's no need to cut off, do I make some mistake here?

@pkufool @yaozengwei Thanks for the explain, I got your point. However, these padding & cutting off things introduce another (params.right_context + 2) * params.subsampling_factor + 3 delay (110ms for zero right context) and some redundant computation, my idea is disable padding of first conv when inference, and manaully keep 1 history and future frame for padding, in this case we only introduce 1 extra frame delay and the cutting off is no longer needed I guess? Or just don't use padding in training stage, I don't think it will make perfermance degradation.

pkufool commented 1 year ago

@Cescfangs Yes, you are right, removing the padding in training stage is a good idea, we do these padding and cutting off thing just to make our code back compatible. We have not released the version that without padding, though.

Cescfangs commented 1 year ago

@Cescfangs Yes, you are right, removing the padding in training stage is a good idea, we do these padding and cutting off thing just to make our code back compatible. We have not released the version that without padding, though.

One more question, according to my graph the first 3 frames of first chunk are never used?