Closed ghost closed 6 years ago
@kpu any help on this issue ?
The OOV probability is lower than that of all seen words, which makes logical sense. If your task requires something other than that, you should really add an OOV counter feature and tune the weight.
As a kludge, you can add --interpolate_unigrams 0
to steal some mass from words and give it to OOV. Or edit the ARPA file and increase the log probability of <unk>
.
Or try byte-pair encoding like the cool kids.
I am working on an ASR problem which requires new words to be adapted on the fly. I am able to add new words and re-compile the .lm file but the model is still not able to fix wrong words. I guess the OOV word [ or the new words that got added ]probability is too low and needs to be re-estimated. Same is the case for Proper nouns. Any help?