kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

Estimating probabilities of new words #128

Closed ghost closed 6 years ago

ghost commented 6 years ago

I am working on an ASR problem which requires new words to be adapted on the fly. I am able to add new words and re-compile the .lm file but the model is still not able to fix wrong words. I guess the OOV word [ or the new words that got added ]probability is too low and needs to be re-estimated. Same is the case for Proper nouns. Any help?

ghost commented 6 years ago

@kpu any help on this issue ?

kpu commented 6 years ago

The OOV probability is lower than that of all seen words, which makes logical sense. If your task requires something other than that, you should really add an OOV counter feature and tune the weight.

As a kludge, you can add --interpolate_unigrams 0 to steal some mass from words and give it to OOV. Or edit the ARPA file and increase the log probability of <unk>.

Or try byte-pair encoding like the cool kids.