How can i get the keep the number of 1-ngrams consistent with number of words contained in the vocab?

sf9218 commented 4 years ago

I knew when using SRILM ,we can use the -limit-vocab -vocab vocab_file to train the lm,so the number of 1-gram words will be the same with the number of words contained in a vocab. When i use Kenlm, I also used the -limit_vocab_file just like SRILM.But i found the lm trained by the kenlm has different number of words of the vocab.Specifically,the number of words contained in a vocab is 136922(i use wc -l vocab to get the number of words in a vocab). But the number of 1-grams in the lm trained by the Kenlm is:

\data\ ngram 1=25846

We can see there is only 25846 1-ngrams in the lm.

How can i keep the number of 1-ngrams consistent with number of words contained in the vocab?

dophist commented 4 years ago

@sf9218 I guess you want to specify kenlm's vocabulary exactly as your vocabulary even though some of the words are not presented in your training text, this is a common prerequisite in closed-vocab applications such as speech recognition.

KenLm doesn't prune unigram. Knowing this, you can manipulating your training text to contain every word in your vocabulary at least once, then the trained .arpa will have exactly the same unigram as your vocab, plus \<s> \</s> and \<unk>.

One simple practice, you can just append your vocab_file to your training text.

manjuke commented 1 year ago

I am facing similar issue. Any update on this? Thanks

kpu / kenlm

How can i get the keep the number of 1-ngrams consistent with number of words contained in the vocab? #243