Space after unnormalized token is added when `use_fast=True` for Llama tokenizer

huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

https://huggingface.co/docs/tokenizers

Apache License 2.0

9.01k stars 795 forks source link

Space after unnormalized token is added when `use_fast=True` for Llama tokenizer #1613

Open Butanium opened 2 months ago

Butanium commented 2 months ago

In my current project, I'd like to add a special token that doesn't insert a space to the next token. Currently, I need to specify use_fast=False in order for this to work. However:

This is unclear to me why I should expect slow and fast tokenizer to behave in different ways
This finding doesn't generalize to e.g. gemma tokenizer which never add such space

Maybe this should be an explicit option in the tokenizer_kwargs?

from transformers import AutoTokenizer
fast_tokenizer = AutoTokenizer.from_pretrained("meta-llama/llama-2-7b-hf", use_fast=True)
slow_tokenizer = AutoTokenizer.from_pretrained("meta-llama/llama-2-7b-hf", use_fast=False)
tok = fast_tokenizer.bos_token
s = f'a:{tok}->'
print(f"fast: {fast_tokenizer.tokenize(s)}\nslow: {slow_tokenizer.tokenize(s)}")
>>> fast: ['▁a', ':', '<s>', '▁->']
>>> slow: ['▁a', ':', '<s>', '->']

Butanium commented 2 months ago

Oh wait @ArthurZucker is that what you're fixing here in https://github.com/huggingface/tokenizers/pull/1568 ?

Butanium commented 2 months ago

Same issue with unnormalized non-special tokens:

from tokenizers import AddedToken
from transformers import AutoTokenizer
tok_name = "meta-llama/llama-2-7b-hf"
fast_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=True)
slow_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=False)
tok = "<special>"
t = AddedToken(
    tok, normalized=False, special=False
 )
fast_tokenizer.add_tokens([t])
slow_tokenizer.add_tokens([t])
s = f'hello:{tok}->'
print(f"fast: {fast_tokenizer.tokenize(s)}\nslow: {slow_tokenizer.tokenize(s)}")
>>> fast: ['▁hello', ':', '<special>', '▁->']
>>> slow: ['▁hello', ':', '<special>', '->']

Butanium commented 2 months ago

And there is even more differences when you add normalized=True for special tokens ...

from tokenizers import AddedToken
from transformers import AutoTokenizer
tok_name = "meta-llama/llama-2-7b-hf"
fast_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=True)
slow_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=False)
tok = "<special>"
t = AddedToken(
    tok, normalized=True, special=True
 )
fast_tokenizer.add_tokens([t], special_tokens=True)
slow_tokenizer.add_tokens([t], special_tokens=True)
s = f'hello:{tok}->'
print(f"fast: {fast_tokenizer.tokenize(s)}\nslow: {slow_tokenizer.tokenize(s)}")
>>> fast: ['▁hello', ':', '<', 'special', '>', '->']
>>> slow: ['▁hello', ':', '<special>', '->']

Butanium commented 2 months ago

Also, if you specify the add_prefix_space arg, the tokenizer is actually using the slow implementation which leads to different behavior for the above code! https://github.com/huggingface/transformers/blob/9485289f374d4df7e8aa0ca917dc131dcf64ebaf/src/transformers/models/llama/tokenization_llama_fast.py#L154

ArthurZucker commented 2 months ago

No this was fixed a LONG time ago!

from tokenizers import AddedToken
from transformers import AutoTokenizer
tok_name = "meta-llama/llama-2-7b-hf"
fast_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=True, legacy=False, from_slow=True)
slow_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=False)
tok = "<special>"
t = AddedToken(
    tok, normalized=True, special=True
 )
fast_tokenizer.add_tokens([t], special_tokens=True)
slow_tokenizer.add_tokens([t], special_tokens=True)
s = f'hello:{tok}->'
print(f"fast: {fast_tokenizer.tokenize(s)}\nslow: {slow_tokenizer.tokenize(s)}")
>>> fast: ['▁hello', ':', '<', 'special', '>', '->']
>>> slow: ['▁hello', ':', '<special>', '->']

ArthurZucker commented 2 months ago

See #1357

Butanium commented 2 months ago

Hey @ArthurZucker, thanks for your answer. I'm using 0.19.1 which should have the fix. I'm really confused right now. Why isn't the fact that use_fast alters the behavior of the tokenizer an issue? My more practical question is, is there a way to add a token s.t.:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model, ...) # magic kwargs
# magically add <special>
s = f'a:<special>->'
print(tokenizer.tokenize(s)})

will always prints [ {whatever}, '<special>', '->'] where the key point here is that there is a -> and not a _-> token

ArthurZucker commented 2 months ago

Yes, what effects this is the legacy flag, as Llama was added before we fixed the issue.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model, legacy=False) # magic kwargs
# magically add <special>
s = f'a:<special>->'
print(tokenizer.tokenize(s)})

When you set legacy to False you might not always get the conversion from slow, which forces the legacy attribute to be actually taken into account!

Butanium commented 1 month ago

Ok so I should do some unit test and choose different kwarg depending on the tokenizer to get the same behavior?

ArthurZucker commented 1 month ago

No, sorry. Basically you can just check the tokenizer's pre_tokenizer. If it's metaspace, the prepend_scheme should be set to first instead of always