kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.5k stars 512 forks source link

Arbitrary token boundaries #285

Open gkucsko opened 4 years ago

gkucsko commented 4 years ago

Is there a way to do ngram estimation with custom token separation? The idea would be to get the following behavior: Hi, this is a sentence. -> Hi, ,, this, is, a, sentence, . My email is frodo@shire.com. -> My, email, is, frodo, @, shire, ., com, . Another option could be to treat certain characters such as ., , or @ as additional whitespace characters (maybe through the --skip_symbols flag?) to get an ngram estimate as if those characters were whitespace. Is there more documentation on that flag or am I misunderstanding the use?

kpu commented 4 years ago

Would require editing the code here:

https://github.com/kpu/kenlm/blob/7af246801e05b5f3b9d2f6a34a820f8d9379f41a/lm/builder/corpus_count.cc#L242

Could probably be made into a command line option.