kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

Don't output <unk>? #234

Closed PCerles closed 5 years ago

PCerles commented 5 years ago

Is there any way to enforce kenlm to not output <unk> as a unigram?

kpu commented 5 years ago

Not in the current code. If you want all the mass on words, edit lm/builder/interpolate.cc to muck with vocabulary size on line 167. Then edit the printer to skip the unknown line in output.

PCerles commented 5 years ago

Great, thank you!