huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.67k stars 743 forks source link

[Bug?] Modifying normalizer for pretrained tokenizers don't consistently work #1552

Open alvations opened 2 weeks ago

alvations commented 2 weeks ago

I'm not sure if it's a bug/feature, sometimes modifying the normalizer of a pretrained tokenizer works but sometimes it doesn't.

For example, it works for "mistralai/Mistral-7B-v0.1" but not "mistralai/Mistral-7B-v0.3":

from transformers import AutoTokenizer
from tokenizers.normalizers import Sequence, Replace, Prepend

tokenizer_name = "mistralai/Mistral-7B-v0.1"
old_tok = AutoTokenizer.from_pretrained(tokenizer_name)

assert old_tok.backend_tokenizer.normalizer != None

new_normalizer = Sequence(
    [Prepend('▁'), Replace('▁', ' '), Replace("foo", "bar"), Replace('<br>', '\n')]
)

old_tok.backend_tokenizer.normalizer = new_normalizer
new_tokenizdr_name = f"new_tokenizer-{tokenizer_name}"
old_tok.save_pretrained(new_tokenizdr_name)

old_tok = AutoTokenizer.from_pretrained(tokenizer_name)
new_tok = AutoTokenizer.from_pretrained(new_tokenizdr_name)

[out]:

>>> print(' '.join(old_tok.batch_decode(old_tok("I foo you<br>hello world")['input_ids'])))
<s> I foo you < br > hello world

>>> print(' '.join(new_tok.batch_decode(new_tok("I foo you<br>hello world")['input_ids'])))
<s>  I  bar  you 
 hello  world

The same process above won't work for "mistralai/Mistral-7B-v0.3".

alvations commented 2 weeks ago

I think it is indeed a bug, another user found that overriding the class indirectly is a workaround the missing normalizer after modification.

https://stackoverflow.com/questions/78612251/how-do-we-add-modify-the-normalizer-in-a-pretrained-huggingface-tokenizer/78624238#78624238