kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.
Apache License 2.0
416 stars 89 forks source link

Using nemo language models #50

Closed ntaiblum closed 2 years ago

ntaiblum commented 2 years ago

Hello, We are using your package with nemo’s Conformer-CTC as the acoustic model, and a language model that was trained using nemo’s script train_kenlm.py. When running your beam-search decoder only with the conformer, it works great. But When we are trying to run it with the language model we’re getting poor results (very high WER), which are worst than running without the LM. To our understanding, nemo’s train_kenlm.py script creates a LM using the conformer’s tokenizer and performs a ‘trick’ that encodes the sub-word tokens of the training data as unicode characters with an offset in the unicode table. As a result, in nemo’s script for beam search decoder, they perform the same ‘trick’ on the vocabulary before the beam search itself, and convert the output text from unicodes to the original tokens. We’re afraid that it might be the reason for our results. We would really appreciate it if you could instruct us how to use nemo’s conformer with a language model that was trained using nemo’s train_kenlm.py script. In addition, when exploring the language model issue, we noticed that your beam search decoder can run with nemo’s conformer together with any KenLM language model, even ones that were created with a different tokenizer than the conformer. Isn’t the LM scoring being performed on the tokens? If so, how is it possible if the tokens of the conformer and the language model are different? Thanks

gkucsko commented 2 years ago

Hey, thanks yeah that's correct. Our default method is to have the LM be a standard word-based one and leave the BPE merging up to the decoder. This way we avoid the re-mapping and also preserve the correct n-gram context of words rather than just fragments. As for nemo, I believe train_kenlm.py automatically chooses the encoding_level based on the model type. I would suggest trying to hard code that to 'char' for your BPE model and then try again using that LM. Maybe @VahidooX could chime in here on recommended operation. Just make sure to use consistent upper/lower case in vocabulary and language model as well as a list of all unigrams to get good decoding results. If you have a minimum viable code example, I'm also happy to have a look.

VahidooX commented 2 years ago

If you want to use pyctcdecode, then I do not think you can/need to use NeMo's train_kenlm.py. One difference between pyctcdecode and NeMo's LM decoding is that unicode encoding trick, the other one is that we train the KenLM on the subwords while if I understand correctly, pyctcdecode uses a KenLM model trained on word level. So if you want to use pyctcdecode, you do not need the asr model or its tokenizer during the LM training. In this case, you can just simply train a regular KenLM with just one command.

ntaiblum commented 2 years ago

Thanks for clarifying this!

gkucsko commented 2 years ago

hope it works, feel free to reopen if you are still seeing issues.