Open to-schi opened 1 year ago
I looked into this a while back as I also noticed the extended load time with unigrams - the unigram data is used for scoring beams, specifically when dealing with wordparts (partial words). The beam search code assigns a score by matching against a character level trie built out of the unigrams. The load time is building out this character trie
I encountered an accuracy hit when leaving out the unigram data in my own use-case so very much a YMMV situation
Thank you! I have to test again and maybe find a way to get rid of the warnings in colab. Not working so far:
import warnings
warnings.filterwarnings("ignore")
I am working on integrating the Nemo logits with pyctcdecode for decoding. I derived "unique" words from the training text and stored them in "unigrams_list". I called pyctcdecoder as below: decoder_lm = build_ctcdecoder(asr_model.decoder.vocabulary, kenlm_model_path="\<kenlm_model_file>", unigrams="\<unigrams_list>") This is giving me below warning, Only 0.0% of unigrams in vocabulary found in kenlm model-- this might mean that your vocabulary and language model are incompatible. Is this intentional?
The results that I get are poorer than the results obtained through decoding without LM. Pls suggest. Thanks
I am working on integrating the Nemo logits with pyctcdecode for decoding. I derived "unique" words from the training text and stored them in "unigrams_list". I called pyctcdecoder as below: decoder_lm = build_ctcdecoder(asr_model.decoder.vocabulary, kenlm_model_path="
", unigrams=" ") This is giving me below warning, Only 0.0% of unigrams in vocabulary found in kenlm model-- this might mean that your vocabulary and language model are incompatible. Is this intentional? The results that I get are poorer than the results obtained through decoding without LM. Pls suggest. Thanks
In this case I would check to see what words are in the kenlm model. Could it be a character-based or word-part based model? That would probably explain both the possible incompatibility of the unigrams and the LM and also likely the poor results
I am using a Kenlm 4gram language model (binary) with Deepspeech2 that works quite well, but I constantly get warnings, that seem unnecessary:
"WARNING:pyctcdecode.decoder:Unigrams not provided and cannot be automatically determined from LM file (only arpa format). Decoding accuracy might be reduced. WARNING:pyctcdecode.language_model:No known unigrams provided, decoding results might be a lot worse."
When I provide a list of unigrams like this, the warnings are gone, the performance seems to be the same, but the computation time is significantly higher:
In which case is providing extra unigrams relevant?