Closed isaacleeai closed 5 years ago
https://www.elastic.co/blog/using-korean-analyzers
Example input:
this is a sentence .
here is another sentence .
tokenizers split , off for you !
Would this be correct?
along with the 4 rules mentioned above, these are additional formatting rules for the input to bin/lmplz
.
, !
, ?
I am still not sure:
bin/lmplz
tokenizes my input for me? In your website, you say "tokenization and preprocessing" should be done before hand. But the above example is not tokenized ( right? since punctuations are treated as word boundaries, and there are punctuations in the above example you gave ).
Is it a MUST ( or any meaningful ) to have punctuation at the end of the sentence, orbin/lmplz
not care becuase it removes them anyways?
The link you gave me seems to talk about a specific way to tokenize for the Korean language. So, if bin/lmplz
already has a tokenizer, how should I replace it with another tokenizer? ( if tokenizer is not, then it would be much easier, but inferring from the example you gave which is not tokenized, it seems like the binary does the tokenization inside )
Thanks!
lmplz is not a tokenizer. You tokenize it.
A sentence is a line. A space delimits tokens. That's all I care about.
The example I gave is tokenized.
lmplz is not a tokenizer. You tokenize it. A sentence is a line. A space delimits tokens. That's all I care about. The example I gave is tokenized.
for some reasons, can I change space (for delimits tokens) with another token? i.e "|"
I have seen other issues where you have suggested using the moses tokenizer. Unfortunately, I am tokenizing korean so moses won't work. Could you post an examble of the input
text
file inbin/lmplz -o 5 <text >text.arpa
?Btw, I have seen your corpus formatting notes, and this is a set of rules I have now.
'\0', '\t', '\r', ' '
'\n'
utf-8
encoding<s>, </s>, <unk>
— because these are added internally, or turn on--skip_symbols
An example of such input file in English would tremendously help! Thanks