Open MilkClouds opened 3 weeks ago
$ pip list | grep tokenizer ✹
tokenizers 0.19.1
$ pip list | grep transformers
sentence-transformers 3.0.0
transformers 4.41.2
Hey! I think most of these can be removed if you set the legacy=False
flag when initializing the tokenizer. I'll talk to the M4 team about this.
Basically the normalizer
is prepending a space before each token, and before each split! For more details https://github.com/huggingface/transformers/pull/28881
When I'm trying to add some tokens in vocab, there's 3 issue in
Fast
type tokenizers; there's no problem in python tokenizer, though.Source code to recall issue
execution result
Additional Note
If I use
from_slow
option to load Fast Tokenizer, it have no problem.tokenizer = LlamaTokenizerFast.from_pretrained("HuggingFaceM4/idefics2-8b", from_slow=True)