History state info lost in nnet computer in cuda-decoder, which cause accuray decreases

housebaby commented 4 years ago

In cuda-decoder, it seems that if a rnn model is used, the NnetComputer will re-initialized for each chunk. That's ok for non-recurrent network . But for recurrent network , it means the history information ( like cell state of LSTM) will be lost. Though we can recover the accuray by appending left context (egs. 40) , the realtTimeX will decreased while the latency will be increased, thus making it unable to do real-time decoding, as the latency may be larger than the chunk size. Is it possible to use previous chunk lstm state to initialize current NnetComputer, like DecodableAmNnetLoopedOnline does?

frames-per-chunk	12	21	21	6	21	21	12	30	baseline(online2-wav-nnet3-latgen-faster)
extra-left-context	0	0	40	40	140	40	40	40	0
cuda-use-tensor-cores	T	T	T	T	T	F	F	F	F
RealTimeX	391.8295	389.701	347.0624	146.4589	176.2909	275.510625	189.78275	330.1645
Latencies	0.09675	0.124875	0.445125	4.870625	3.51	1.26	3.104	0.6005
Latencies-90%	0.157	0.188875	0.65275	7.1775	5.06775	1.840625	4.5725	0.8495
Latencies-95%	0.207625	0.203875	0.724375	8.114875	5.773375	2.059	5.162125	0.932375
Latencies-99%	0.269	0.223875	0.875125	9.986	7.023	2.51075	6.35875	1.1115
字准	90.04%	92.30%	94.77%	94.01%	94.71%	94.81%	94.48%	94.69%	94.99%
句准	71.34%	77.31%	82.95%	81.12%	82.85%	83.00%	82.22%	82.83%	83.40%

Is it possible to derive cudadecoder in a loop way? @hugovbraun @danpovey

hugovbraun commented 4 years ago

You're right, the current neural net context switch mechanism of the online pipeline has been designed for CNN-based networks.

Regarding relying on the inner state of a looped computer, we cannot really do that because two batches are always different. Batch 1 may contain chunks from utt2, utt7, and utt4, while batch 2 may contain chunks from utt3, utt4 and utt7. The inner state is per batch slot and everything would get mixed up.

We could have a version for RNN-based model, by storing/restoring the inner output of the lstm cells. We would just need a way to get those tensors. Something like computer->GetInnerOuput() or GetLSTMCell(). Or maybe we can just do the context switch using GetOutput() and SetInput() if the RNN is not a loop? @danpovey what do you think?

housebaby commented 4 years ago

You're right, the current neural net context switch mechanism of the online pipeline has been designed for CNN-based networks.

Regarding relying on the inner state of a looped computer, we cannot really do that because two batches are always different. Batch 1 may contain chunks from utt2, utt7, and utt4, while batch 2 may contain chunks from utt3, utt4 and utt7. The inner state is per batch slot and everything would get mixed up.

We could have a version for RNN-based model, by storing/restoring the inner output of the lstm cells. We would just need a way to get those tensors. Something like computer->GetInnerOuput() or GetLSTMCell(). Or maybe we can just do the context switch using GetOutput() and SetInput() if the RNN is not a loop? @danpovey what do you think?

@hugovbraun Now, I see. Thank you very much. Then do you have any plan to surpport loop mode of RNN in cudadecoder?

danpovey commented 4 years ago

Regarding the RNN stuff: sorry, I don't have the bandwidth to work on that. It's quite complicated, and I'm spending what energy I have on next-gen stuff, e.g. the k2 project.

On Mon, Aug 3, 2020 at 12:20 PM housebaby notifications@github.com wrote:

You're right, the current neural net context switch mechanism of the online pipeline has been designed for CNN-based networks.

Regarding relying on the inner state of a looped computer, we cannot really do that because two batches are always different. Batch 1 may contain chunks from utt2, utt7, and utt4, while batch 2 may contain chunks from utt3, utt4 and utt7. The inner state is per batch slot and everything would get mixed up.

We could have a version for RNN-based model, by storing/restoring the inner output of the lstm cells. We would just need a way to get those tensors. Something like computer->GetInnerOuput() or GetLSTMCell(). Or maybe we can just do the context switch using GetOutput() and SetInput() if the RNN is not a loop? @danpovey https://github.com/danpovey what do you think?

@hugovbraun https://github.com/hugovbraun Now, I see. Thank you very much. Then do you have any plan to surpport loop mode of RNN in cudadecoder?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4189#issuecomment-667794336, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLOZRMDWDIK7GECHY2EDR6Y3JJANCNFSM4PFKIJ3A .

hugovbraun commented 4 years ago

Are you moving away from RNN-type models completely? Or are you just saying that nnet3 will be soon deprecated and we should do that work on the next gen stuff instead (e.g. with pytorch running the neural net, or else) ?

danpovey commented 4 years ago

I'm not against RNNs, it's just that I don't have the bandwidth right now to handle the stuff required to do RNNLMs efficiently on GPU, and also work on the next-gen stuff (yes, that will involve pytorch for the neural net).

On Tue, Aug 4, 2020 at 4:48 AM Hugo Braun notifications@github.com wrote:

Are you moving away from RNN-type models completely? Or are you just saying that nnet3 will be soon deprecated and we should do that work on the next gen stuff instead (e.g. with pytorch running the neural net, or else) ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kaldi-asr/kaldi/issues/4189#issuecomment-668233984, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO3YS4WQXVWKDHWRTQTR64PADANCNFSM4PFKIJ3A .

hugovbraun commented 4 years ago

I understand, we could actually look at working on it ourselves, it's just a matter of knowing what to do. The context switch mechanism for RNNs would be fairly straightforward, but the big question is should we make it compatible with nnet3 or is it going to be outdated soon? @housebaby it wouldn't be a loop mode, because it would require having a static batch slot per audio channel. Long story short, it would run with batch size 1. However we can have the exact same RNN network run in non-loop mode and add a context switch mechanism on top of it. Similar to what we do with CNNs but in a RNN-friendly way. It would be transparent to the user.

kaldi-asr / kaldi

History state info lost in nnet computer in cuda-decoder, which cause accuray decreases #4189