Open Cescfangs opened 1 year ago
We cut off the first and the last frames during streaming forward, since we have padding=1
in the first conv layer in the Conv2dSubsampling
module:
https://github.com/k2-fsa/icefall/blob/e36ea89112bb3d81602cb4df51bd68e6d06dc150/egs/wenetspeech/ASR/pruned_transducer_stateless5/conformer.py#L1464-L1469
When getting a chunk of frames, we would pad extra 2 * subsampling_factor
frames for this:
https://github.com/k2-fsa/icefall/blob/e36ea89112bb3d81602cb4df51bd68e6d06dc150/egs/wenetspeech/ASR/pruned_transducer_stateless5/decode_stream.py#L78
So it does not cause information lost.
You could refer to decode_stream.py
to see how we get chunks:
https://github.com/k2-fsa/icefall/blob/e36ea89112bb3d81602cb4df51bd68e6d06dc150/egs/wenetspeech/ASR/pruned_transducer_stateless5/decode_stream.py#L121
thanks for the reply, I understand first conv module need extra input frames for padding, taking chunk_size(after subsampling) = 1 for similicity, I think ((1 2) + 1)2 + 1 + 2(padding for first conv) = 9 frames is enough, so it could work like this:
and there's no need to cut off, do I make some mistake here?
@Cescfangs To make the behavior the same as training time, I think we have to do cutting off, for each chunk the last frame of output (I mean the top layer in your graph) sees the padding value, that is different from training time, during training time the internal chunks never see padding.
thanks for the reply, I understand first conv module need extra input frames for padding, taking chunk_size(after subsampling) = 1 for similicity, I think ((1 2) + 1)2 + 1 + 2(padding for first conv) = 9 frames is enough, so it could work like this:
Suppose we want to get C
frame of output after subsampling, we feed (C + 2) * 4 + 3
input frames, and get (C + 2)
output frames. Then we drop the first one and the last one, since they see zero padding (in the first conv).
thanks for the reply, I understand first conv module need extra input frames for padding, taking chunk_size(after subsampling) = 1 for similicity, I think ((1 2) + 1)2 + 1 + 2(padding for first conv) = 9 frames is enough, so it could work like this:
- first chunk, input 8 frames(left padding zero is ok I think), output first frame
- add another 4 frames to get 2nd output frame
- ...
and there's no need to cut off, do I make some mistake here?
@pkufool @yaozengwei
Thanks for the explain, I got your point.
However, these padding & cutting off things introduce another (params.right_context + 2) * params.subsampling_factor + 3
delay (110ms for zero right context) and some redundant computation, my idea is disable padding of first conv when inference, and manaully keep 1 history and future frame for padding, in this case we only introduce 1 extra frame delay and the cutting off is no longer needed I guess? Or just don't use padding in training stage, I don't think it will make perfermance degradation.
@Cescfangs Yes, you are right, removing the padding in training stage is a good idea, we do these padding and cutting off thing just to make our code back compatible. We have not released the version that without padding, though.
@Cescfangs Yes, you are right, removing the padding in training stage is a good idea, we do these padding and cutting off thing just to make our code back compatible. We have not released the version that without padding, though.
One more question, according to my graph the first 3 frames of first chunk are never used?
https://github.com/k2-fsa/icefall/blob/e36ea89112bb3d81602cb4df51bd68e6d06dc150/egs/wenetspeech/ASR/pruned_transducer_stateless5/conformer.py#L340-L346
As I see it, the
encoder_embed
module works like that(setting input len =15 for example):If we cut off two frames from the output, 6 frames of input will not be used, could there be any information lost?