k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
842 stars 273 forks source link

Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network #1401

Open AlexandderGorodetski opened 7 months ago

AlexandderGorodetski commented 7 months ago

Hello guys,

In this article https://arxiv.org/pdf/2104.11127.pdf there is very simple idea of how to adapt RNN-T to the unknown text.

The writer of the article reports about very nice WER improvement (it can be reduction of WER by 4% absolute).

Maybe someone from forum already implemented this mechanism?

Thanks, AlexG.

danpovey commented 7 months ago

Their method relies on fine-tuning the predictor (I think we call this the decoder), and since in Icefall recipes we use a "stateless" predictor/decoder which sees only (typically) 2 symbols, I doubt it would be very useful to fine-tune it. It might work, but it's hard to say.

AlexandderGorodetski commented 7 months ago

Dan thank you for your answer. Currently context_size is indeed 2. Is it possible to increase the context_size to be 3 or even 4. Maybe you've already tried these experiments? Will K2 support using of context_size greater than 2 ?

yaozengwei commented 7 months ago

We did try using context-size of 3 or 4 (at least in Zipformer), but could not get improvements.

yunigma commented 6 months ago

I have implemented this type of adaptation for the pruned-stateless transducer on Gigaspeech. As Dan wrote it does not work with this type of predictor. I have also tried to train the model with the higher context-size but it always performs worse than with context-size=2 (a colleague of mine also confirmed this observation).

Additionally, I have tested this adaptation method with the same model but replacing the predictor with an LSTM one. In this case, the adaptation seems to work but the general baseline WER goes up, so even the adapted version is worse than the pruned-stateless transducer. Probably my model with LTSM predictor needs more parameter tuning, which I did not do. The conclusion is that this adaptation method works with LTSM predictor but then the model itself was weaker.

@AlexandderGorodetski Let me know if you also tried it!

danpovey commented 6 months ago

I suspect the reason the LSTM predictor might not work well could be that it doesn't give very good gradients to the encoder to train the encoder, e.g. they sometimes blow up. If you can implement it, it might be worthwhile to train a model where you have a "stateless" predictor and also an LSTM predictor, but the LSTM predictor uses a detached version of the encoder input so its gradients are not used to train the encoder.