kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.46k stars 514 forks source link

How sentences are tokenized? #407

Closed MagedSaeed closed 1 year ago

MagedSaeed commented 1 year ago

Thanks for the great software.

Just a question to tokenize my text accordingly, how the sentence markers are added internally as mentioned in the docs? Are they added by splits of \n?

kpu commented 1 year ago

lmplz and query treat '\n' in the data as a sentence split. A sentence split implicitly conditions on <s> and appends </s>.

MagedSaeed commented 1 year ago

Thanks for your reply and clarification @kpu