glample / fastBPE

Fast BPE
MIT License
656 stars 96 forks source link

Query about sentence vs word tokenization #8

Closed sbmaruf closed 5 years ago

sbmaruf commented 5 years ago

There are different ways byte pair encoding could be applied.

  1. Apply on stream: Apply bpe on the text stream. So for this case space ' ' and newline '\n' are considered as regular character. And we apply bpe to the whole text stream.
  2. Apply on sentence: Calculate the bpe sentence by sentence. So for this case, tokenize the stream with '\n' and apply bpe operation per sentence.
  3. Apply on word: At first tokenize the dataset based with space ' ' and newline '\n'. Then calculate bpe over the words. If I want to achieve Apply on word is it possible with this code? @glample
glample commented 5 years ago

I have only seen "apply on sentence" and "apply on word" in practice. Apply on word is the standard method as far as I know, and this is what this repository does. fastBPE expects the input to be tokenized (using Moses tools or something equivalent).

Note that this code could also be used for "apply on sentence", for that you could hack something like replacing spaces ' ' by some rare symbol that does not appear in your dataset, and make fastBPE believe that each sentence has no space and is composed of a single word.

sbmaruf commented 5 years ago

Thank you for the information @glample