huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.68k stars 745 forks source link

BpeTrainer seems to ignore max_token_length=1 #1461

Closed geajack closed 2 months ago

geajack commented 4 months ago

In the following script, the resulting vocabulary contains tokens of length > 1.

from tokenizers.trainers import BpeTrainer
from tokenizers import Tokenizer
from tokenizers.models import BPE

trainer = BpeTrainer(max_token_length=1)

tokenizer_spec = Tokenizer(BPE())
tokenizer_spec.train_from_iterator(["hello world"], trainer=trainer)
vocab = tokenizer_spec.get_vocab()
print(vocab)

What I'd expect, instead, would just be to get a vocabulary consisting of all of the characters in the corpus.

ArthurZucker commented 3 months ago

I can indeed reproduce. It actually works for any other value:


In [2]: from tokenizers.trainers import BpeTrainer
   ...: from tokenizers import Tokenizer
   ...: from tokenizers.models import BPE
   ...: 
   ...: trainer = BpeTrainer(max_token_length=64)
   ...: 
   ...: tokenizer_spec = Tokenizer(BPE())
   ...: tokenizer_spec.train_from_iterator(["hello world, orl lorld, corld forld"], trainer=trainer)
   ...: vocab = tokenizer_spec.get_vocab()
   ...: print(vocab)

but I don't think you can go lower than 2. Most probably an issue with https://github.com/huggingface/tokenizers/blob/d3e80085a8022c4592c7df0f1d542f13a93333f2/tokenizers/src/models/bpe/word.rs#L106

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.