Closed SeverinoDaDalt closed 5 months ago
What exactly do you mean by pre-tokenization here? Sentencepiece does not have a pre-tokenization step because it processes the raw text directly.
Sorry, I assumed that sentencepiece does some pre-processing on the sentences and that this was called pre-tokenization.
What I am asking about are what are the rules by which sentencepiece splits words during training (for example you cannot get a token which is you▁are
or ▁other...)
). I know that it does whitespace splitting and, by the a quick analysis of the resulting vocabulary, it separates sequences of punctuation symbols and sequences of alphanumeric symbols. Are there more of these regex?
Sentencepiece doesn't have the concept like "words". There are several languages that do not have whitespace between words. In these languages, it is not trivial to define and run the pre-tokenization step. The main motivation of sentence piece is to get rid of all these complicated pre-tokenization step to make the tokenization language independent. sentencepiece doesn't have any language dependent word tokenization rules/regexp/patterns etc.
"you are" is not split because the vocab doesn't contain the token "you▁are" --split_by_whitespace
option of traininer allows to extract token you▁are.
string normalization e.g. NKFC is only the prerprocessing. Please see the following document to configure the normalization. https://github.com/google/sentencepiece/blob/master/doc/normalization.md
I am trying to compute how efficient is the spm tokenizer I trained. To do that, I would like to be able to compare the length of its tokenization with respect to the spm pretokenization (as a gold standard) given a specific dataset.
Is there an option to use the spm library to do only pretokenization? In case there is not, what are the pretokenization rules used? I found nothing on this in the documentation.