google / sentencepiece

Unsupervised text tokenizer for Neural Network-based text generation.
Apache License 2.0
10.26k stars 1.18k forks source link

Query about sentence vs word tokenization #291

Closed sbmaruf closed 5 years ago

sbmaruf commented 5 years ago

There are different ways byte pair encoding can be applied.

  1. Apply on stream: Apply bpe on the text stream. So for this case space ' ' and newline '\n' are considered as regular character. And we apply bpe to the whole text stream.
  2. Apply on sentence: Calculate the bpe sentence by sentence. So for this case, tokenize the stream with '\n' and apply bpe operation per sentence.
  3. Apply on word: At first tokenize the dataset based with space ' ' and newline '\n'. Then calculate bpe over the words. What does sentencepiece do, Apply on stream or Apply on sentence? @taku910 Actually I want to achieve Word based BPE tokenization. If sentencepiece do work like Apply on sentence in that case, I guess, I can achieve Apply on word by putting one word in one line in the input text file and calculate the bpe for that. Is that so? @taku910
taku910 commented 5 years ago

If the definition of word is the space-delimited token, the default behavior is "apply on word". sentencepiece will not get the tokens crossing two words. This mode can be disabled by --split_by_whitespace=false

By the way, I personally do not want to use the term "word" as the definition is not clear in CJK. In CJK, we have to run word segmenter to make a space-delimited tokens like in English.

mani-rai commented 2 years ago

Hey guys, can somebody tell me the definition of sentence used in sentencepiece.

  1. Is this literal sentence used in English language deliminated by dot, question mark, etc.
  2. Or is it a text deliminated by line break?