kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 513 forks source link

How kenlm work on subword/wordpiece. would you suggest a command line? thanks. #219

Closed bjtommychen closed 5 years ago

kpu commented 5 years ago

--discount_fallback should be all you need to deal with finite vocabulary. It will still generate <unk> though.

bjtommychen commented 5 years ago

Thanks. one more question. The SubWords system will convert 'hello world' into '_he llo _w or ld' Is kenlm tools still be suitable for this? and for subword, kenlm will treat it as character-based or not? Thanks.

kpu commented 5 years ago

The tokens are anything separated by whitespace. Where you choose to put whitespace is your choice / problem.