Closed meliksahturker closed 1 month ago
It seems this is related to the issue here.
Wrapping the pattern by tokenizers.Regex fixed the issue.
from tokenizers import Regex
digit_split_pretokenization_pattern = Regex(r'\d')
Nice that you found the fix!
Training a BPE tokenizer from scratch, I am using Split pretokenization. In the below example, I split on each digit so that numbers are represented by the sequences of digits they are made of.
Encode function works as intended where each digit is split.
However, this pattern is not applied to the vocabulary.
It still results in several tokens made of different number of digits, whereas my intend was to have each digit as a separate token and not have any other tokens that contain digits.