kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.49k stars 510 forks source link

tokenizing a different language #204

Closed isaacleeai closed 5 years ago

isaacleeai commented 5 years ago

I have seen other issues where you have suggested using the moses tokenizer. Unfortunately, I am tokenizing korean so moses won't work. Could you post an examble of the input text file in bin/lmplz -o 5 <text >text.arpa?

Btw, I have seen your corpus formatting notes, and this is a set of rules I have now.

  1. Words are delimitated by one of '\0', '\t', '\r', ' '
  2. Lines are delimitated by '\n'
  3. use utf-8 encoding
  4. Remove any training symbols — <s>, </s>, <unk> — because these are added internally, or turn on --skip_symbols

An example of such input file in English would tremendously help! Thanks

kpu commented 5 years ago

https://www.elastic.co/blog/using-korean-analyzers

Example input:

this is a sentence .
here is another sentence .
tokenizers split , off for you !
isaacleeai commented 5 years ago

Would this be correct? along with the 4 rules mentioned above, these are additional formatting rules for the input to bin/lmplz

  1. can have special characters, and there needs to be space around special character when there are texts on both sides.
  2. each sentence MUST end with either of ., !, ?

I am still not sure:

  1. bin/lmplz tokenizes my input for me? In your website, you say "tokenization and preprocessing" should be done before hand. But the above example is not tokenized ( right? since punctuations are treated as word boundaries, and there are punctuations in the above example you gave ).

  2. Is it a MUST ( or any meaningful ) to have punctuation at the end of the sentence, orbin/lmplz not care becuase it removes them anyways?

  3. The link you gave me seems to talk about a specific way to tokenize for the Korean language. So, if bin/lmplz already has a tokenizer, how should I replace it with another tokenizer? ( if tokenizer is not, then it would be much easier, but inferring from the example you gave which is not tokenized, it seems like the binary does the tokenization inside )

Thanks!

kpu commented 5 years ago

lmplz is not a tokenizer. You tokenize it.
A sentence is a line. A space delimits tokens. That's all I care about.
The example I gave is tokenized.

Khanifsaleh commented 3 years ago

lmplz is not a tokenizer. You tokenize it. A sentence is a line. A space delimits tokens. That's all I care about. The example I gave is tokenized.

for some reasons, can I change space (for delimits tokens) with another token? i.e "|"