k2-fsa / icefall

https://k2-fsa.github.io/icefall/
Apache License 2.0
931 stars 295 forks source link

LM shallow fusion and LODR in aishell #920

Closed li563042811 closed 1 year ago

li563042811 commented 1 year ago

Hi, When will you support modified beam search with LM shallow fusion and LODR in Chinese corpus, such as aishell?

csukuangfj commented 1 year ago

Hi, When will you support modified beam search with LM shallow fusion and LODR in Chinese corpus, such as aishell?

We already have the code. I think it can be reused for aishell.

li563042811 commented 1 year ago

For librispeech the rnnlm can be trained with bpe dictionary. But for aishell, the best is trained in char. The rnnlm training code in https://github.com/k2-fsa/icefall/tree/master/icefall/rnn_lm is for bpe and it has some problem to train a rnnlm with the same token table with a well-trained asr model.

marcoyang1998 commented 1 year ago

When will you support modified beam search with LM shallow fusion and LODR in Chinese corpus, such as aishell?

This is on our schedule.

The first step will be preparing training data in char. Once we figure out the best data format, we can use the RNNLM training code.

I will update here as soon as we make some progress.

li563042811 commented 1 year ago

Thank you for getting back to me. I also tried to break the bind between token_tables of ASR and LM so that the LM training can be more flexible. I map the token_list to char_list with ASR token_table, then map the char_list to token_list with LM token_table so that the token_list can be used for LM inferring. But this method will increase a lot of time-consuming decoding. So I think maybe training LM with the same token_table still is the best way for LM fusion.

li563042811 commented 1 year ago

I modified the generation process of the Chinese language model training corpus. Instead of using the sentencepiece model for mapping, I changed it to use the specified tokens.txt for char-level mapping. However aishell has too little training text, and the RNNLM model tends to overfit. The ppl of the trained RNN language model on the test set corpus is around 70, and the shallow fusion does not reduce the CER. However, if more training corpus(aishell2) is added, the ppl will decrease significantly, and the reduction in CER will be pronounced after shallow fusion(9% relative decrease).

marcoyang1998 commented 1 year ago

That's great!

I changed it to use the specified tokens.txt for char-level mapping

Could you please make a PR for the change you made? You can add the corresponding python scripts under egs/aishell/ASR/local.

Also, would it be possible if you can upload the RNNLM you trained to huggingface so that others can use it? Here is an example: https://huggingface.co/ezerhouni/icefall-librispeech-rnn-lm/tree/main. Thanks!

li563042811 commented 1 year ago

AIright, I will sort out the relevant code and submit a PR according to your request. But the model is temporarily inconvenient to upload due to some reasons.

marcoyang1998 commented 1 year ago

Thank you!

li563042811 commented 1 year ago
I don't know why I can't comment in https://github.com/k2-fsa/icefall/pull/945. Here is some CER results of LM shallow fusion and LODR based on pruned_transducer_stateless2. Although the language model trained only with the transcript of the aishell training set did not significantly reduce the CER, the PPL and CER dropped significantly after adding the aishell2 corpus. decoding method ppl-test ppl-dev cer-test cer-dev
modified beam search / / 5.09 4.67
modified beam search LM shallow fusion with rnnlm 75.048 77.346 5.17 4.77
modified beam search LODR with rnnlm and 2gram 75.048 77.346 5.04 4.64
modified beam search LM shallow fusion with transformerlm 56.122 58.554 5.16 4.74
modified beam search LODR with transformerlm and 2gram 56.122 58.554 5.02 4.64
modified beam search LM shallow fusion with transformerlm(add aishell2) 15.823(test+aishell2) 6.761(dev+aishell2) 4.05 3.68
modified beam search LODR with transformerlm(add aishell2) and 2gram 15.823(test+aishell2) 6.761(dev+aishell2) 3.93 3.57
marcoyang1998 commented 1 year ago

These are good results! Thanks!

marcoyang1998 commented 1 year ago

@li563042811 Have you tried to train an RNNLM on the aishell1 + aishell2 text? It's not working for me, the perplexity is over 50 in my experiments.

Also, what is the config (e.g embedding_dim, hidden_dim) for the transformer LM and RNN LM? Would you mind sharing them?

li563042811 commented 1 year ago

@li563042811 Have you tried to train an RNNLM on the aishell1 + aishell2 text? It's not working for me, the perplexity is over 50 in my experiments.

Yes, I trained an RNNLM on the aishell1 + aishell2 text with num_layers=2 and hidden_dim=512. PPLs on aishell dev and test are 27.9 and 36.2. Shallow fusion CER on dev and test are 4.21 and 4.63. LODR CER on dev and test are 4.1 and 4.46.

TransformerLM is trained with num_layers=12, nhead=8, embedding_dim=1024, dim_feedforward=2048.

The performance of the model may have something to do with your configuration.

Have you considered adding the fusion of the language model to the sherpa engine? I have some questions about the feasibility of this. Adding the fusion of the language model to beam_search should increase a lot of delays.

marcoyang1998 commented 1 year ago

Thank you for the reply. I will test the code again.

Have you considered adding the fusion of the language model to the sherpa engine?

csukuangfj is currently on leave. Maybe he can share some ideas on this when he comes back.

li563042811 commented 1 year ago

Thank you for the reply. I will test the code again.

Have you considered adding the fusion of the language model to the sherpa engine?

csukuangfj is currently on leave. Maybe he can share some ideas on this when he comes back.

For added aishell2 LM, the test set and dev set used to calculate the PPL also include aishell1 and aishell2. Sorry I didn't mention it above. CERs are all computed on only aishell1 test and dev dataset.

marcoyang1998 commented 1 year ago

For added aishell2 LM, the test set and dev set used to calculate the PPL also include aishell1 and aishell2

You mean the test set for PPL is a combination of aishell1+aishell2 test set?

li563042811 commented 1 year ago

For added aishell2 LM, the test set and dev set used to calculate the PPL also include aishell1 and aishell2

You mean the test set for PPL is a combination of aishell1+aishell2 test set?

Yes, for added aishell2 LM, the train set , test set and dev set are all combination of aishell1+aishell2.

marcoyang1998 commented 1 year ago

I double-checked the Aishell1 and Aishell2 datasets and I found that part of the validation set of Aishell1 can be found in the Aishell2 training set. Did you prune them out?

li563042811 commented 1 year ago

I double-checked the Aishell1 and Aishell2 datasets and I found that part of the validation set of Aishell1 can be found in the Aishell2 training set. Did you prune them out?

No, I'm not. I just checked it and also found part of the test and valid set of Aishell1 in the Aishell2 training set. So the above result is problematic, LM needs to retrain and then use to decode. Sorry I didn't see this before.

marcoyang1998 commented 1 year ago
I pruned them out and trained an RNNLM. Without much tuning I get the following WERs on pruned_transducer_stateless3: decoding_method PPL (aishell1-dev) CER-dev CER-test
modified_beam_search - 4.79 5.05
modified_beam_search_lm_fusion 50.014 4.64 4.86
modified_beam_search_LODR 50.014 4.46 4.70
li563042811 commented 1 year ago

I pruned them out and trained an RNNLM. Without much tuning I get the following WERs on pruned_transducer_stateless3:

decoding_method PPL (aishell1-dev) CER-dev CER-test modified_beam_search - 4.79 5.05 modified_beam_search_lm_fusion 50.014 4.64 4.86 modified_beam_search_LODR 50.014 4.46 4.70

Excellent, I will prune the test and valid text from the train set, then train a new RNNLM and TransformerLM.

Ryuk17 commented 2 months ago

Can you shared the trained model ? I have trained a RNN LM and Transformer LM on aishell1, but the wer with modified_beam_search_lm_rescoring is even higher than that with modified_beam_search