Closed sbmaruf closed 5 years ago
If the definition of word is the space-delimited token, the default behavior is "apply on word". sentencepiece will not get the tokens crossing two words. This mode can be disabled by --split_by_whitespace=false
By the way, I personally do not want to use the term "word" as the definition is not clear in CJK. In CJK, we have to run word segmenter to make a space-delimited tokens like in English.
Hey guys, can somebody tell me the definition of sentence used in sentencepiece.
There are different ways byte pair encoding can be applied.
Apply on stream
orApply on sentence
? @taku910 Actually I want to achieve Word based BPE tokenization. Ifsentencepiece
do work like Apply on sentence in that case, I guess, I can achieve Apply on word by putting one word in one line in the input text file and calculate the bpe for that. Is that so? @taku910