Closed geajack closed 2 months ago
I can indeed reproduce. It actually works for any other value:
In [2]: from tokenizers.trainers import BpeTrainer
...: from tokenizers import Tokenizer
...: from tokenizers.models import BPE
...:
...: trainer = BpeTrainer(max_token_length=64)
...:
...: tokenizer_spec = Tokenizer(BPE())
...: tokenizer_spec.train_from_iterator(["hello world, orl lorld, corld forld"], trainer=trainer)
...: vocab = tokenizer_spec.get_vocab()
...: print(vocab)
but I don't think you can go lower than 2
.
Most probably an issue with https://github.com/huggingface/tokenizers/blob/d3e80085a8022c4592c7df0f1d542f13a93333f2/tokenizers/src/models/bpe/word.rs#L106
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
In the following script, the resulting vocabulary contains tokens of length > 1.
What I'd expect, instead, would just be to get a vocabulary consisting of all of the characters in the corpus.