[WIP]: Implement token level shallow fusion

csukuangfj commented 1 year ago

We have been trying to use word-level G and LG for RNN-T decoding, but we have only tried this for fast_beam_search. However, using a word-level G or an LG cannot handle OOV words.

This PR tries to use a token-level G for shallow fusion with modified_beam_search. I am using OpenFst to manipulate the n-gram G on the CPU as it is easier to implement.

ezerhouni commented 1 year ago

@csukuangfj Look very promising. Ping me if you need an extra hand

csukuangfj commented 1 year ago

@csukuangfj Look very promising. Ping me if you need an extra hand

@ezerhouni

Thanks! I will draft a version without batch size support. If it gives promising results, we need your help to implement a version that supports batches.

ezerhouni commented 1 year ago

@csukuangfj Do you have any update on this issue ? I am very eager to try it out !

csukuangfj commented 1 year ago

@csukuangfj Do you have any update on this issue ? I am very eager to try it out !

Yes. But the results are not good so far. I will post them tonight.

csukuangfj commented 1 year ago

Steps for reproducing the following results:

cd egs/librispeech/ASR
git lfs install
git clone https://huggingface.co/csukuangfj/icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03
mkdir tmp3-3
cd tmp3-3
ln -s $PWD/../https://huggingface.co/csukuangfj/icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03/exp/pretrained-iter-468000-avg-16.pt epoch-99.pt
cd ..

./generate-lm.sh

for lm_scale in  0.01 0.2 0.4 ; do
./lstm_transducer_stateless2/decode.py \
  --epoch 99 \
  --avg 1 \
  --use-averaged-model 0 \
  --exp-dir ./tmp3-3 \
  --max-duration 600 \
  --num-encoder-layers 12 \
  --rnn-hidden-size 1024 \
  --decoding-method modified_beam_search2 \
  --beam 8 \
  --max-contexts 4 \
  --ngram-lm-scale $lm_scale
done

You will find the results inside ./tmp3-3/modified_beam_search2

ngram_lm_scale	test-clean	test-other
0 (baseline)	2.73	7.15
-0.01	2.73	7.17
0.01	2.74	7.15
-0.05	2.75	7.19
0.2	2.76	7.28
-0.1	2.77	7.23
-0.2	2.83	7.46
-0.3	3.01	7.75

I am using a tri-gram LM. Note the cost on the final state of the FST is not considered

I will recheck the code in case it contains some bugs.

ezerhouni commented 1 year ago

@csukuangfj Thanks !

danpovey commented 1 year ago

I expect that unless there is some kind of domain mismatch, we will not see much or any improvement. (Unless we try super-large LMs. I seem to remember Liyong had some experiment with a 5-gram or something like that?)

csukuangfj commented 1 year ago

I expect that unless there is some kind of domain mismatch, we will not see much or any improvement. (Unless we try super-large LMs. I seem to remember Liyong had some experiment with a 5-gram or something like that?)

I think Liyong was using fast_beam_search + (L, or LG) in https://github.com/k2-fsa/icefall/pull/472

We have never tried to use a token-level G with modified beam search, I think.

ezerhouni commented 1 year ago

I expect that unless there is some kind of domain mismatch, we will not see much or any improvement. (Unless we try super-large LMs. I seem to remember Liyong had some experiment with a 5-gram or something like that?)

I think Liyong was using fast_beam_search + (L, or LG) in #472

We have never tried to use a token-level G with modified beam search, I think.

My 2cts is that we need a very large LM (like 5gram). I will try it tomorrow and let you know

pkufool commented 1 year ago

I expect that unless there is some kind of domain mismatch, we will not see much or any improvement. (Unless we try super-large LMs. I seem to remember Liyong had some experiment with a 5-gram or something like that?)

I think Liyong was using fast_beam_search + (L, or LG) in #472

We have never tried to use a token-level G with modified beam search, I think.

@glynpu Liyoug did try using a token-level G with beam search, he did not make a PR though, the results are in our weekly meeting notes (the 20th week), as the follows:

The results show that we can not get improvement from a pruned LM.

glynpu commented 1 year ago

@glynpu Liyoug did try using a token-level G with beam search, he did not make a PR though, the results are in our weekly meeting notes (the 20th week), as the follows:

The results came from a word level LM. I was using kenlm at that time, here is the related code: https://github.com/glynpu/icefall/commit/3a9ff316f3601900fdff751bcc31636740c5b1a6

ezerhouni commented 1 year ago

@csukuangfj Quick update : I am testing with a 5gram at the moment. I am getting test-clean : 2.68 test-other: 7.11

I am still doing some tests and do a more thorough review of the code.

ezerhouni commented 1 year ago

Ngram : 5 Beam Size 4 :

ngram_lm_scale	test-clean	test-other
0 (baseline)	2.73	7.15
0.01	2.74	7.15
0.1	2.68	7.11
0.2	2.68	7.14

Ngram : 5 Beam Size 8 :

ngram_lm_scale	test-clean	test-other
0 (baseline)	2.72	7.15
0.01	2.71	7.14
0.1	2.71	7.11
0.2	2.68	7.06
0.3	2.74	7.28

csukuangfj commented 1 year ago

@ezerhouni

Thanks! Are you using ./generate-lm.sh to generate the 5-gram LM or are you using an LM trained on an external dataset?

ezerhouni commented 1 year ago

@ezerhouni

Thanks! Are you using ./generate-lm.sh to generate the 5-gram LM or are you using an LM trained on an external dataset?

I am using ./generate-lm.sh. I am trying a 7gram to have an idea if it helps or not.

ezerhouni commented 1 year ago

@csukuangfj I tried a 7gram and it seems to improve a bit (2.67/7.03) but I am not sure it is worth it

danpovey commented 1 year ago

I think the main use-case of this is when there is a domain mismatch from the training corpus to the target domain. We can also try dividing the scores on the LM arcs by the corresonding scores given a low-order LM estimated on the training data.

csukuangfj commented 1 year ago

@csukuangfj I tried a 7gram and it seems to improve a bit (2.67/7.03) but I am not sure it is worth it

Sorry for the late replay. I though I have replied last night.

I think 7gram is more than enough. Thanks for your experiments. The result shows that the code works with an n-gram LM, though we don't gain much from it. The next step is to use it to decode with a graph constructed from lists of specific words/phrases that we want to recognize.

ezerhouni commented 1 year ago

@csukuangfj I tried a 7gram and it seems to improve a bit (2.67/7.03) but I am not sure it is worth it

Sorry for the late replay. I though I have replied last night.

I think 7gram is more than enough. Thanks for your experiments. The result shows that the code works with an n-gram LM, though we don't gain much from it. The next step is to use it to decode with a graph constructed from lists of specific words/phrases that we want to recognize.

I agree, I think 5gram is enough. I was thinking to use it for detecting OOV words. I will let you know once I have more results. (except if you have something in mind)

csukuangfj commented 1 year ago

@csukuangfj I tried a 7gram and it seems to improve a bit (2.67/7.03) but I am not sure it is worth it

Sorry for the late replay. I though I have replied last night. I think 7gram is more than enough. Thanks for your experiments. The result shows that the code works with an n-gram LM, though we don't gain much from it. The next step is to use it to decode with a graph constructed from lists of specific words/phrases that we want to recognize.

I agree, I think 5gram is enough. I was thinking to use it for detecting OOV words. I will let you know once I have more results. (except if you have something in mind)

By the way, @marcoyang1998 is using the RNN-LM model that you provided for conformer CTC for shallow fusion and he can get a WER 2.46 for test-clean without being specifically tuned.

ezerhouni commented 1 year ago

@csukuangfj I tried a 7gram and it seems to improve a bit (2.67/7.03) but I am not sure it is worth it

Sorry for the late replay. I though I have replied last night. I think 7gram is more than enough. Thanks for your experiments. The result shows that the code works with an n-gram LM, though we don't gain much from it. The next step is to use it to decode with a graph constructed from lists of specific words/phrases that we want to recognize.

I agree, I think 5gram is enough. I was thinking to use it for detecting OOV words. I will let you know once I have more results. (except if you have something in mind)

By the way, @marcoyang1998 is using the RNN-LM model that you provided for conformer CTC for shallow fusion and he can get a WER 2.46 for test-clean without being specifically tuned.

Sounds interesting ! If I am not mistaken, we can't add new word on the fly to an already trained RNN-LM isn't it ?

csukuangfj commented 1 year ago

Sounds interesting ! If I am not mistaken, we can't add new word on the fly to an already trained RNN-LM isn't it ?

The RNN-LM is at token level, so as long as the new word can be represented by the bpe tokens, it can be rescored by the RNN-LM, I think.

ezerhouni commented 1 year ago

The RNN-LM is at token level, so as long as the new word can be represented by the bpe tokens, it can be rescored by the RNN-LM, I think.

Indeed, but we can't "boost" specific words (or combination of specific tokens)

csukuangfj commented 1 year ago

The RNN-LM is at token level, so as long as the new word can be represented by the bpe tokens, it can be rescored by the RNN-LM, I think.

Indeed, but we can't "boost" specific words (or combination of specific tokens)

Yes, you are right. That is why we are trying to integrate FST into decoding.

ezerhouni commented 1 year ago

@csukuangfj I have a batch version (à la modified_beam_search), I took your commits and added mine on top of it (with a rebase), I will create a new PR if that's ok

csukuangfj commented 1 year ago

@csukuangfj I have a batch version (à la modified_beam_search), I took your commits and added mine on top of it (with a rebase), I will create a new PR if that's ok

Yes, thanks! I will close this PR once you create a new PR.

csukuangfj commented 1 year ago

See #630

k2-fsa / icefall

[WIP]: Implement token level shallow fusion #609