pyctcdecode not working with Nemo finetuned model

manjuke commented 1 year ago

Hi All, I am working on pyctcdecode integration with Nemo ASR models. It works very well (without errors) for pre-trained nemo models like "stt_en_conformer_ctc_small" in below code snippet:

import nemo.collections.asr as nemo_asr myFile=['sample-in-Speaker_1-11.wav'] asr_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained( model_name='stt_en_conformer_ctc_small') logits = asr_model.transcribe(myFile, logprobs=True)[0] print((logits.shape, len(asr_model.decoder.vocabulary))) decoder = build_ctcdecoder(asr_model.decoder.vocabulary) decoder.decode(logits)

The same code snippet fails, if I use a fine-tuned nemo model in place of pretrained model. The error says, "ValueError: Input logits shape is (36, 513), but vocabulary is size 512. Need logits of shape: (time, vocabulary)" The fine-tuned model is loaded as below: asr_model = nemo_asr.models.EncDecCTCModelBPE.restore_from(restore_path="<path to fine-tuned model>")

Pls suggest @gkucsko @lopez86 . Thanks

manjuke commented 1 year ago

I found that this happens if there is \<pad> token in vocabulary. I took conformer-en-ctc-medium model and finetuned using the same Tamil dataset without \<pad> tokens, and tried to decode with pyctcdecode. It worked without any issues. And, produces expected output. So, Please check it might be a bug. Or, suggest if there is any work around to exclude \<pad> symbols (for decoding using pyctcdecode) using models that are trained with \<pad> symbols.

I have even posted this to Nvidia Nemo team at https://github.com/NVIDIA/NeMo/issues/5738 They suggested to check with you further. Please refer https://github.com/NVIDIA/NeMo/issues/5738 for all details @gkucsko @lopez86 . Thanks

lopez86 commented 1 year ago

as a quick workaround, would it be possible to try to just add to the vocabulary when you build the decoder? You might still have to do some postprocessing to deal with it but it should block the error from happening. For a more complete solution I think we would want to add an argument somewhere with a set of special non-word tokens that get ignored somehow in decoding.

manjuke commented 1 year ago

Sure, Thanks for the response. Currently \<pad> token is already included in the vocabulary list that is passed to "build_ctcdecoder".

I managed to train a model without \<pad> tokens. The build_ctcdecoder works without any errors, if I do not include kenlm_model_path ( & pass only vocabulary) as below: decoder = build_ctcdecoder(asr_model.decoder.vocabulary) But, it gives an error if I pass "kenlm_model_path" arguement as below, decoder = build_ctcdecoder(asr_model.decoder.vocabulary, kenlm_model_path=kenlm_model_file)

The error produced is: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 62: invalid start byte

Any suggestions on this. Pls help @lopez86

lopez86 commented 1 year ago

Can you provide more information on the error? What function is generating it? What kind of system do you have (Windows sometimes has errors that other systems don't)?

manjuke commented 1 year ago

Thanks for the response. I could resolve this. This happens if I use the LM generated by nemo script "NeMo-main/scripts/asr_language_modeling/ngram_lm/train_kenlm.py". The contents of the generated arpa file can not be read by opening in vim editor.

This error does not occur, if I use the LM generated directly from kenlm command as below: kenlm/build/bin/lmplz -o 5 < train_text_lm.txt > bh_discntfalbak_5gram.arpa The contents of the generated arpa file can be read by opening in vim editor.

Not sure what is going on. Thanks

lopez86 commented 1 year ago

I haven't been able to reproduce this yet, but I have a few ideas.

First, since you're finding that you can't see the contents in vim, are you sure that you're looking at an arpa file and not the binary output from train_kenlm.py? pyctcdecode is able to use binary files but you would need to provide your own list of unigrams

Second, are there any non-utf-8 characters in your character set? Obviously non-utf-8 characters in the output would potentially generate a decoding error if you're decoding into utf-8. This maybe can be addressed by adding better support for arbitrary encodings

Third, pyctcdecode assumes that you have a word-based ngram model. For models/checkpoints derived from EncDecCTCModelBPE it looks like train_kenlm.py might be training a BPE character-based ngram model using the model tokenizer's internal encoding scheme. In this case, I'm not sure the output is compatible with pyctcdecode. you should still be able to open the files with vim though

manjuke commented 1 year ago

Ok Sure, Thank you for the response.

I had tried First approach earlier by supplying Unigrams list. It was working without error. But, the WERs of hypothesis was very bad. I feel that the list of Unigrams that was provided was incorrect.

I had extracted unique words from the text used for building LM, and called it as Unigram list. Is it correct? Or is there any other way to get Unigram. Please suggest. I did not get a better understanding of how to generate list of unigrams.

And, also I had even tried providing the list of sub-word tokens of train text as Unigrams list. But, that too had very bad WERs issue. So, it was not clear on how to generate unigrams.

kensho-technologies / pyctcdecode

pyctcdecode not working with Nemo finetuned model #93