huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.01k stars 795 forks source link

Space after unnormalized token is added when `use_fast=True` for Llama tokenizer #1613

Open Butanium opened 2 months ago

Butanium commented 2 months ago

Related to: https://github.com/huggingface/transformers/issues/25073

In my current project, I'd like to add a special token that doesn't insert a space to the next token. Currently, I need to specify use_fast=False in order for this to work. However:

Butanium commented 2 months ago

Oh wait @ArthurZucker is that what you're fixing here in https://github.com/huggingface/tokenizers/pull/1568 ?

Butanium commented 2 months ago

Same issue with unnormalized non-special tokens:

from tokenizers import AddedToken
from transformers import AutoTokenizer
tok_name = "meta-llama/llama-2-7b-hf"
fast_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=True)
slow_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=False)
tok = "<special>"
t = AddedToken(
    tok, normalized=False, special=False
 )
fast_tokenizer.add_tokens([t])
slow_tokenizer.add_tokens([t])
s = f'hello:{tok}->'
print(f"fast: {fast_tokenizer.tokenize(s)}\nslow: {slow_tokenizer.tokenize(s)}")
>>> fast: ['▁hello', ':', '<special>', '▁->']
>>> slow: ['▁hello', ':', '<special>', '->']
Butanium commented 2 months ago

And there is even more differences when you add normalized=True for special tokens ...

from tokenizers import AddedToken
from transformers import AutoTokenizer
tok_name = "meta-llama/llama-2-7b-hf"
fast_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=True)
slow_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=False)
tok = "<special>"
t = AddedToken(
    tok, normalized=True, special=True
 )
fast_tokenizer.add_tokens([t], special_tokens=True)
slow_tokenizer.add_tokens([t], special_tokens=True)
s = f'hello:{tok}->'
print(f"fast: {fast_tokenizer.tokenize(s)}\nslow: {slow_tokenizer.tokenize(s)}")
>>> fast: ['▁hello', ':', '<', 'special', '>', '->']
>>> slow: ['▁hello', ':', '<special>', '->']
Butanium commented 2 months ago

Also, if you specify the add_prefix_space arg, the tokenizer is actually using the slow implementation which leads to different behavior for the above code! https://github.com/huggingface/transformers/blob/9485289f374d4df7e8aa0ca917dc131dcf64ebaf/src/transformers/models/llama/tokenization_llama_fast.py#L154

ArthurZucker commented 2 months ago

No this was fixed a LONG time ago!

from tokenizers import AddedToken
from transformers import AutoTokenizer
tok_name = "meta-llama/llama-2-7b-hf"
fast_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=True, legacy=False, from_slow=True)
slow_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=False)
tok = "<special>"
t = AddedToken(
    tok, normalized=True, special=True
 )
fast_tokenizer.add_tokens([t], special_tokens=True)
slow_tokenizer.add_tokens([t], special_tokens=True)
s = f'hello:{tok}->'
print(f"fast: {fast_tokenizer.tokenize(s)}\nslow: {slow_tokenizer.tokenize(s)}")
>>> fast: ['▁hello', ':', '<', 'special', '>', '->']
>>> slow: ['▁hello', ':', '<special>', '->']
ArthurZucker commented 2 months ago

See #1357

Butanium commented 2 months ago

Hey @ArthurZucker, thanks for your answer. I'm using 0.19.1 which should have the fix. I'm really confused right now. Why isn't the fact that use_fast alters the behavior of the tokenizer an issue? My more practical question is, is there a way to add a token s.t.:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model, ...) # magic kwargs
# magically add <special>
s = f'a:<special>->'
print(tokenizer.tokenize(s)})

will always prints [ {whatever}, '<special>', '->'] where the key point here is that there is a -> and not a _-> token

ArthurZucker commented 2 months ago

Yes, what effects this is the legacy flag, as Llama was added before we fixed the issue.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model, legacy=False) # magic kwargs
# magically add <special>
s = f'a:<special>->'
print(tokenizer.tokenize(s)})

When you set legacy to False you might not always get the conversion from slow, which forces the legacy attribute to be actually taken into account!

Butanium commented 1 month ago

Ok so I should do some unit test and choose different kwarg depending on the tokenizer to get the same behavior?

ArthurZucker commented 1 month ago

No, sorry. Basically you can just check the tokenizer's pre_tokenizer. If it's metaspace, the prepend_scheme should be set to first instead of always