Closed sbmaruf closed 5 years ago
I have only seen "apply on sentence" and "apply on word" in practice. Apply on word is the standard method as far as I know, and this is what this repository does. fastBPE expects the input to be tokenized (using Moses tools or something equivalent).
Note that this code could also be used for "apply on sentence", for that you could hack something like replacing spaces ' ' by some rare symbol that does not appear in your dataset, and make fastBPE believe that each sentence has no space and is composed of a single word.
Thank you for the information @glample
There are different ways byte pair encoding could be applied.