Open dvrogozh opened 18 hours ago
Okey, had time to look at the issue and I see that before we had no errors raised when users provided non-string/non-int inputs for special tokens. We would just cast list to str as "['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']"
and assign that value which is not the correct way to do it
I don't think we should bring back the old behavior which doesn't make much sense, although not sure if accepting lists of tokens is a good idea for tokenizers
. AFAIK tokenizers assumes that special tokens are only one id/str
. So I'll say we should assing with
tokenizer.pad_token_id = tokenizer.eos_token_id
and not tokenizer.pad_token_id = model.config.eos_token_id
Might be worth a fix on TGI side?
@zucchini-nlp : thank you for feedback. I've posted https://github.com/huggingface/text-generation-inference/pull/2774 in TGI following your proposal.
Originally from https://github.com/huggingface/text-generation-inference/issues/2440.
With:
Setting pad token to point to Llama 3 models eos token fails for the reason that Llama 3 has a list of eos tokens instead of single value. What is the correct way to handle this?
The following script is a simplified version of what TGI is doing when working with the stack which don't support attention (for attention TGI follows another path and does not hit this issue). TGI has similar code at https://github.com/huggingface/text-generation-inference/blob/07bed530f7eaf2419ed0e755e0f24d7afd814a46/server/text_generation_server/models/causal_lm.py#L634
Script:
Output:
Printing values which were tried to be assigned to pad_token:
['<|end_of_text|>', '<|eom_id|>', '<|eot_id|>']
CC: @ArthurZucker @Narsil