Only Pretokenization - Githubissues

SeverinoDaDalt commented 6 months ago

I am trying to compute how efficient is the spm tokenizer I trained. To do that, I would like to be able to compare the length of its tokenization with respect to the spm pretokenization (as a gold standard) given a specific dataset.

Is there an option to use the spm library to do only pretokenization? In case there is not, what are the pretokenization rules used? I found nothing on this in the documentation.

taku910 commented 5 months ago

What exactly do you mean by pre-tokenization here? Sentencepiece does not have a pre-tokenization step because it processes the raw text directly.

SeverinoDaDalt commented 5 months ago

Sorry, I assumed that sentencepiece does some pre-processing on the sentences and that this was called pre-tokenization.

What I am asking about are what are the rules by which sentencepiece splits words during training (for example you cannot get a token which is you▁are or ▁other...)). I know that it does whitespace splitting and, by the a quick analysis of the resulting vocabulary, it separates sequences of punctuation symbols and sequences of alphanumeric symbols. Are there more of these regex?

taku910 commented 5 months ago

Sentencepiece doesn't have the concept like "words". There are several languages that do not have whitespace between words. In these languages, it is not trivial to define and run the pre-tokenization step. The main motivation of sentence piece is to get rid of all these complicated pre-tokenization step to make the tokenization language independent. sentencepiece doesn't have any language dependent word tokenization rules/regexp/patterns etc.

"you are" is not split because the vocab doesn't contain the token "you▁are" --split_by_whitespace option of traininer allows to extract token you▁are.

string normalization e.g. NKFC is only the prerprocessing. Please see the following document to configure the normalization. https://github.com/google/sentencepiece/blob/master/doc/normalization.md

google / sentencepiece

Only Pretokenization #988