turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

Llama3 doesn't define `pad_token_id`, it defaults to 0, which tokenizer.json has as the token ID for '!' #415

Closed VldmrB closed 2 months ago

VldmrB commented 2 months ago

That's pretty much it. Not sure what the best solution is here. Maybe not using the pad_token_id at all, if it's not defined?

https://huggingface.co/turboderp/Llama-3-70B-exl2/blob/5.0bpw/config.json
https://huggingface.co/turboderp/Llama-3-8B-exl2/blob/6.0bpw/config.json

https://huggingface.co/turboderp/Llama-3-70B-exl2/raw/5.0bpw/tokenizer.json

Qubitium commented 2 months ago

All official meta llama models do not define pad_token. Just set to unk_token like here (hf code):

tokenizer.pad_token = tokenzier.unk_token
tokenizer.pad_token_id = tokenizer.unk_token_id
grimulkan commented 2 months ago

There is no unk_token either. Maybe use the EOS token <|end_of_text|> or <|eot_id|>? Apparently both of those seem to have been used interchangeably in the instruct model during training for EOS (according to https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct/discussions/2).

Also, according to https://github.com/meta-llama/llama3/issues/42 the padding side trained is "left", so the number of padding tokens do affect the positional encodings for the unmasked text.

VldmrB commented 2 months ago

All official meta llama models do not define pad_token. Just set to unk_token like here (hf code):

tokenizer.pad_token = tokenzier.unk_token
tokenizer.pad_token_id = tokenizer.unk_token_id

Not true, unless people have just been adding it themselves to the config, e.g.
https://huggingface.co/NousResearch/Llama-2-13b-hf/blob/main/config.json

which is doubtful.

But even if they didn't define it, the problem is ExLlama defaults it to 0 in that case, which with Llama3, conflicts with an actual token ID.
I'd like for the fix to be in the repo, rather than on my end only

grimulkan commented 2 months ago

Using the unk token (id 0) was indeed common as a pad in Llama-2, but that doesn't work now.

Qubitium commented 2 months ago

All official meta llama models do not define pad_token. Just set to unk_token like here (hf code):

tokenizer.pad_token = tokenzier.unk_token
tokenizer.pad_token_id = tokenizer.unk_token_id

Not true, unless people have just been adding it themselves to the config, e.g. https://huggingface.co/NousResearch/Llama-2-13b-hf/blob/main/config.json

which is doubtful.

But even if they didn't define it, the problem is ExLlama defaults it to 0 in that case, which with Llama3, conflicts with an actual token ID. I'd like for the fix to be in the repo, rather than on my end only

Is this the meta official llama you just linked to?

I am trying to help you.

https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/tokenizer_config.json https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/config.json

pad_tokens are not set/null.

VldmrB commented 2 months ago

Is this the meta official llama you just linked to?

I am trying to help you.

https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/tokenizer_config.json https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/config.json

pad_tokens are not set/null.

I already said I'm not looking for a workaround. Also, your solution involves the unk token, which isn't defined in the config either.

No idea what your links say, never signed the Meta agreement (until now). If so, cool, but how does that change anything for ExLlama defaulting it to 0?

If you want to actually help, you can send a PR, with a solution, instead of talking down & past.

Edit: just got access
generation_config.json

Qubitium commented 2 months ago

I am sorry if I offended you.

There is no workaround. Pad can be any unused and/or non-conflicting token. Models such as llama doesn't define pad token, they should, but that's besides the point.

In fact, even if model specifiy pad token to 24254, anyone can change that pad_token to another non-conflicting token to 2323222 as long as the token is unused (preferrably) and in the tokenizer/embeed range.

For many people the logic is this. This is not a work around. It just is for many models with no pad tokens.

  1. is Pad token none? next
  2. is Unk token not none? Yes then use. Stop
  3. is eos token not none? Yes then use. Stop or Next (preferabbly you want pad token to not eq to eos/bos depending on trainer)
  4. are there any unused/reserved tokens? Yes. Set pad token to first unused. Stop

Llama3 has many unused tokens. Check json.

turboderp commented 2 months ago

In the quants I uploaded I explicitly defined a pad_token_id to get around this, and I've since updated the tokenizer to default to the BOS token instead of a hardcoded default zero (UNK is rarely defined and EOS presents the problem that it's often also interpreted as marking the end of a tokenized string when decoding.)

I'm not sure if there's a more general solution.

VldmrB commented 2 months ago

Thanks, I think this is at least better than having it hard-coded.
Closing the issue