huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.88k stars 765 forks source link

BPE Split pretokenization rule is not reflected in the vocabulary #1572

Closed meliksahturker closed 1 month ago

meliksahturker commented 1 month ago

Training a BPE tokenizer from scratch, I am using Split pretokenization. In the below example, I split on each digit so that numbers are represented by the sequences of digits they are made of.

from datasets import load_dataset
from tokenizers import models, pre_tokenizers, trainers, Tokenizer

# Dataset
ds = load_dataset('HuggingFaceFW/fineweb', streaming = True)['train']
texts = [sample['text'] for sample in ds.take(10_000)]
print(len(texts))

# Init Tokenizer
tokenizer = Tokenizer(models.BPE(unk_token="<UNK>", byte_fallback = True))

digit_split_pretokenization_pattern = r'\d'
split_pretokenizer = pre_tokenizers.Split(pattern = digit_split_pretokenization_pattern, behavior = "isolated", invert = False)
tokenizer.pre_tokenizer = split_pretokenizer

# Sentinel tokens
sentinel_tokens = ["<UNK>", "<BOS>", "<EOS>"]

# Digits
digits = [str(num) for num in range(10)]

# Combine
special_tokens = sentinel_tokens + digits
print('Number of Special Tokens:', len(special_tokens))

trainer = trainers.BpeTrainer(
    vocab_size=32_000,
    special_tokens=special_tokens,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet(),
    min_frequency=2,
    limit_alphabet=1024,
    max_token_length = 32,
    show_progress = True
)
tokenizer.train_from_iterator(texts, trainer = trainer)

# Encode a test sample
text = "This is a text that involves numbers 123,456 and 789.321"
print(tokenizer.encode(text).tokens)
['This is ', 'a t', 'ext', ' that ', 'involves ', 'numbers ', '1', '2', '3', ',', '4', '5', '6', ' and ', '7', '8', '9', '.', '3', '2', '1']

Encode function works as intended where each digit is split.

# Check vocabulary
numeric_tokens_in_vocab = [token for token in tokenizer.get_vocab() if any(char.isnumeric() for char in token)]
print(numeric_tokens_in_vocab[:100])
['1996', '$1', '54 ', 'Version 7.', '$5 ', '3D ', ':5', '3 million ', '70 ', '.\n3', '6) ', '37 ', '228', '10-', '2010-', '500', '250 ', '360', '24/', '$200', '11-', '\n4', '12th ', '69', '-1', '2007', '6/', '1930', '11, ', '1:', '1 cup ', 'May 17', '2.0 ', '77', '7) ', '50-', '2013, ', '0, ', '4th ', '02 ', '⅘', '28 ', '0.3 ', '8 million ', '24', '160', '1.5', '18, ', '$2', '$1,', ', 2009 ', '61 ', '7\n', '27', 'z47', '187', '0', 'in 2012', '2), ', '10\n', '1% ', '9th ', '39 ', '3: ', '07/', '.\n4', ',000 ', '4 ', 'at 3', '15th ', '185', '96', '1993', '8, ', 'in 197', '(1)', '01/', '2011', '$17', '/2012', '7 million ', '482 U.S. 304, ', ':43 ', 'at 8', '5-', ':00 p.m', '14/', '130', '200', '15-', '1 or ', '8-', 'May 14', '10', '12) ', '62 ', '1940', ':53 ', '6:', '100']

However, this pattern is not applied to the vocabulary.

It still results in several tokens made of different number of digits, whereas my intend was to have each digit as a separate token and not have any other tokens that contain digits.

meliksahturker commented 1 month ago

It seems this is related to the issue here.

Wrapping the pattern by tokenizers.Regex fixed the issue.

from tokenizers import Regex
digit_split_pretokenization_pattern = Regex(r'\d')
ArthurZucker commented 1 month ago

Nice that you found the fix!