kensho-technologies / pyctcdecode

A fast and lightweight python-based CTC beam search decoder for speech recognition.
Apache License 2.0
415 stars 89 forks source link

Unigrams not provided warning #85

Open to-schi opened 1 year ago

to-schi commented 1 year ago

I am using a Kenlm 4gram language model (binary) with Deepspeech2 that works quite well, but I constantly get warnings, that seem unnecessary:

"WARNING:pyctcdecode.decoder:Unigrams not provided and cannot be automatically determined from LM file (only arpa format). Decoding accuracy might be reduced. WARNING:pyctcdecode.language_model:No known unigrams provided, decoding results might be a lot worse."

When I provide a list of unigrams like this, the warnings are gone, the performance seems to be the same, but the computation time is significantly higher:

unigrams_file = "./kenlm-model/vocab-500000.txt"
with open(unigrams_file) as f:
    list_of_unigrams = [line.rstrip() for line in f]

def ctc_decoding_lm(logits, model_path=LM_MODEL_PATH, unigrams=list_of_unigrams):

    decoder = pyctcdecode.build_ctcdecoder(
            labels=char_to_int.get_vocabulary(),
            kenlm_model_path = model_path,
            unigrams=unigrams, 
            alpha=0.9, 
            beta=1.2,
        )

    logits = np.squeeze(logits)
    text = decoder.decode(logits)
    return text

In which case is providing extra unigrams relevant?

leecming82 commented 1 year ago

I looked into this a while back as I also noticed the extended load time with unigrams - the unigram data is used for scoring beams, specifically when dealing with wordparts (partial words). The beam search code assigns a score by matching against a character level trie built out of the unigrams. The load time is building out this character trie

I encountered an accuracy hit when leaving out the unigram data in my own use-case so very much a YMMV situation

to-schi commented 1 year ago

Thank you! I have to test again and maybe find a way to get rid of the warnings in colab. Not working so far:

import warnings
warnings.filterwarnings("ignore")
manjuke commented 1 year ago

I am working on integrating the Nemo logits with pyctcdecode for decoding. I derived "unique" words from the training text and stored them in "unigrams_list". I called pyctcdecoder as below: decoder_lm = build_ctcdecoder(asr_model.decoder.vocabulary, kenlm_model_path="\<kenlm_model_file>", unigrams="\<unigrams_list>") This is giving me below warning, Only 0.0% of unigrams in vocabulary found in kenlm model-- this might mean that your vocabulary and language model are incompatible. Is this intentional?

The results that I get are poorer than the results obtained through decoding without LM. Pls suggest. Thanks

lopez86 commented 1 year ago

I am working on integrating the Nemo logits with pyctcdecode for decoding. I derived "unique" words from the training text and stored them in "unigrams_list". I called pyctcdecoder as below: decoder_lm = build_ctcdecoder(asr_model.decoder.vocabulary, kenlm_model_path="", unigrams="") This is giving me below warning, Only 0.0% of unigrams in vocabulary found in kenlm model-- this might mean that your vocabulary and language model are incompatible. Is this intentional?

The results that I get are poorer than the results obtained through decoding without LM. Pls suggest. Thanks

In this case I would check to see what words are in the kenlm model. Could it be a character-based or word-part based model? That would probably explain both the possible incompatibility of the unigrams and the LM and also likely the poor results