Setting tokenizer.pad_token_id = model.config.eos_token_id fails for LLama 3

dvrogozh commented 18 hours ago

Originally from https://github.com/huggingface/text-generation-inference/issues/2440.

With:

https://github.com/huggingface/transformers/commit/befbbf2f98492e2164f185708e62c06fd30f75d1

Setting pad token to point to Llama 3 models eos token fails for the reason that Llama 3 has a list of eos tokens instead of single value. What is the correct way to handle this?

The following script is a simplified version of what TGI is doing when working with the stack which don't support attention (for attention TGI follows another path and does not hit this issue). TGI has similar code at https://github.com/huggingface/text-generation-inference/blob/07bed530f7eaf2419ed0e755e0f24d7afd814a46/server/text_generation_server/models/causal_lm.py#L634

Script:

from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-3.2-3B-Instruct')
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-3.2-3B-Instruct')

print(f">>> tokenizer.pad_token_id={tokenizer.pad_token_id}")
print(f">>> model.config.pad_token_id={model.config.pad_token_id}")
print(f">>> model.config.eos_token_id={model.config.eos_token_id}")

tokenizer.pad_token_id = model.config.eos_token_id

Output:

Loading checkpoint shards: 100%|█████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.03it/s]
>>> tokenizer.pad_token_id=None
>>> model.config.pad_token_id=None
>>> model.config.eos_token_id=[128001, 128008, 128009]
Traceback (most recent call last):
  File "/home/dvrogozh/tmp/e.py", line 11, in <module>
    tokenizer.pad_token_id = model.config.eos_token_id
  File "/home/dvrogozh/git/huggingface/transformers/src/transformers/tokenization_utils_base.py", line 1076, in __setattr__
    raise ValueError(f"Cannot set a non-string value as the {key}")
ValueError: Cannot set a non-string value as the pad_token

Printing values which were tried to be assigned to pad_token: ['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']

CC: @ArthurZucker @Narsil

zucchini-nlp commented 8 hours ago

Okey, had time to look at the issue and I see that before we had no errors raised when users provided non-string/non-int inputs for special tokens. We would just cast list to str as "['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']" and assign that value which is not the correct way to do it

I don't think we should bring back the old behavior which doesn't make much sense, although not sure if accepting lists of tokens is a good idea for tokenizers. AFAIK tokenizers assumes that special tokens are only one id/str. So I'll say we should assing with

tokenizer.pad_token_id = tokenizer.eos_token_id and not tokenizer.pad_token_id = model.config.eos_token_id

Might be worth a fix on TGI side?

dvrogozh commented 2 minutes ago

@zucchini-nlp : thank you for feedback. I've posted https://github.com/huggingface/text-generation-inference/pull/2774 in TGI following your proposal.

huggingface / transformers

Setting tokenizer.pad_token_id = model.config.eos_token_id fails for LLama 3 #34869