kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.
Apache License 2.0
421 stars 89 forks source link

Providing unigrams does not change the output. #105

Closed HendrikLaux closed 1 year ago

HendrikLaux commented 1 year ago

First of all, thank you for the great work! I am using a model based on a Nvidia NeMo model and haven't found any other decoder which allowed me to use a KenLM language model and actually get an improvement on the results with the BPE output. The way NeMo proposed it didn't work for me, which is probably my fault, but anyway, using the word-based KenLMs in this decoder actually works for me and I found it super simple to get it working.

However, although the LM works for me, I have a question regarding the unigrams provided as a parameter.

Before using pyctcdecode I used the torchaudio implementation of ctc_decode in torchaudio.models.decoder. It optionally allows for providing a "lexicon", which is basically a list of potential words that might occur (which according to my understanding is the same as providing the "unigrams"-list in pyctcdecode).

By providing this lexicon and using "lexicon-based" decoding with this decoder, I was able to reduce the WER by around 5% with my fine-tuned model. As mentioned before, I never got the LM integration to work, which is why I switched to this decoder here.

So I tried decoding the log-softmax output of the NeMo model with pyctcdecode ... (1) with unigrams=None and kenlm_model_path=None (2) with unigrams provided and kenlm_model_path=None (3) with unigrams and kenlm model provided

For my model I get the following results (averaged over the full test set): (1): 85.36% (2): 85.36% (3) with alpha=0.1: 71.49% (3) with alpha=0.3: 72.63% (3) with alpha=0.5: 76.05%

Further the text outputs of (1) and (2) are exactly the same. The improvement using the KenLM model is amazing, but I am wondering why there is no difference when I just provide the unigrams vs. when I don't provide anything. Is providing unigrams only meaningful when also providing a LM? Is there a difference between providing unigrams here and a "lexicon" in the torchaudio ctc_decode that I didn't understand?

Edit: I also tried providing the full list of unique words as "hotwords" in the decoding process itself, because I thought this might be more similar to the way torchaudio implements the "lexicon-based" decoding. But this worsened results significantly because the list of words is rather big (and not very "hotword-like").

lopez86 commented 1 year ago

hi, you are correct that currently providing unigrams is only meaningful when it is accompanied by a language model, otherwise it has no effect. I can see how using unigrams without an LM to enforce what words are output might be useful in some contexts as a very simple language model but we don't have that right now.

I'm not sure exactly how the torchaudio.ctc_decode is implemented but it looks like a "lexicon" there is providing a dictionary of equivalent spellings of words, which I imagine will allow for better scoring when a word has multiple valid spellings (like color vs colour, center vs centre, etc). The hotwords in pyctcdecode are words to provide an additional boost to because they are known to be likely to come up, so a large list will likely lead to poorer performance - it should work best with a small number of unusual key terms that might appear in your audio

HendrikLaux commented 1 year ago

Thanks for your quick and detailed response. That answers all my questions!