tokenizing a different language

isaacleeai commented 5 years ago

I have seen other issues where you have suggested using the moses tokenizer. Unfortunately, I am tokenizing korean so moses won't work. Could you post an examble of the input text file in bin/lmplz -o 5 <text >text.arpa?

Btw, I have seen your corpus formatting notes, and this is a set of rules I have now.

Words are delimitated by one of '\0', '\t', '\r', ' '
Lines are delimitated by '\n'
use utf-8 encoding
Remove any training symbols — <s>, </s>, <unk> — because these are added internally, or turn on --skip_symbols

An example of such input file in English would tremendously help! Thanks

kpu commented 5 years ago

https://www.elastic.co/blog/using-korean-analyzers

Example input:

this is a sentence .
here is another sentence .
tokenizers split , off for you !

isaacleeai commented 5 years ago

Would this be correct? along with the 4 rules mentioned above, these are additional formatting rules for the input to bin/lmplz

can have special characters, and there needs to be space around special character when there are texts on both sides.
each sentence MUST end with either of ., !, ?

I am still not sure:

bin/lmplz tokenizes my input for me? In your website, you say "tokenization and preprocessing" should be done before hand. But the above example is not tokenized ( right? since punctuations are treated as word boundaries, and there are punctuations in the above example you gave ).
Is it a MUST ( or any meaningful ) to have punctuation at the end of the sentence, orbin/lmplz not care becuase it removes them anyways?
The link you gave me seems to talk about a specific way to tokenize for the Korean language. So, if bin/lmplz already has a tokenizer, how should I replace it with another tokenizer? ( if tokenizer is not, then it would be much easier, but inferring from the example you gave which is not tokenized, it seems like the binary does the tokenization inside )

Thanks!

kpu commented 5 years ago

lmplz is not a tokenizer. You tokenize it.
A sentence is a line. A space delimits tokens. That's all I care about.
The example I gave is tokenized.

Khanifsaleh commented 3 years ago

lmplz is not a tokenizer. You tokenize it. A sentence is a line. A space delimits tokens. That's all I care about. The example I gave is tokenized.

for some reasons, can I change space (for delimits tokens) with another token? i.e "|"

kpu / kenlm

tokenizing a different language #204