kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 510 forks source link

format of the input text #114

Closed acc8518 closed 6 years ago

acc8518 commented 6 years ago

Hello! Thank you for providing the tool.

I successfully run the command line as follows:

bin/lmplz -o 5 <text >text.arpa

I am just curious that whether there is a format of the input . For example, one sentence in a line, should be ended with certain symbol such as '.' ? I afraid that i did not follow the format and thus obtain a bad language model.

kpu commented 6 years ago

You should probably tokenize the text beforehand. https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl . It only has to end with a period to the extent that sentences naturally end with a period.