InternLM2 math model breaks with exllamav2_HF loader (works with non-HF)

Describe the bug

Trying out https://huggingface.co/bartowski/internlm2-math-20b-llama-exl2 and whenever I use HF loader to generate i get a bunch of errors. When using the non HF loader I get no problems generating.

I also tried updating to dev branch where exllamav2 has been bumped, still doesn't work with HF loader

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Download 4.25 quant from https://huggingface.co/bartowski/internlm2-math-20b-llama-exl2 Load with ExllamaV2 HF loader Attempt to generate any text

Screenshot

No response

Logs

Traceback (most recent call last):
  File "/text-generation-webui/modules/callbacks.py", line 61, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/text-generation-webui/modules/text_generation.py", line 379, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/opt/conda/envs/textgen/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/textgen/lib/python3.11/site-packages/transformers/generation/utils.py", line 1525, in generate
    return self.sample(
           ^^^^^^^^^^^^
  File "/opt/conda/envs/textgen/lib/python3.11/site-packages/transformers/generation/utils.py", line 2622, in sample
    outputs = self(
              ^^^^^
  File "/text-generation-webui/modules/exllamav2_hf.py", line 125, in __call__
    self.ex_model.forward(seq_tensor[:-1].view(1, -1), ex_cache, preprocess_only=True, loras=self.loras)
  File "/opt/conda/envs/textgen/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/textgen/lib/python3.11/site-packages/exllamav2/model.py", line 559, in forward
    r, ls = self._forward(input_ids = input_ids[:, chunk_begin : chunk_end],
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/textgen/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/textgen/lib/python3.11/site-packages/exllamav2/model.py", line 623, in _forward
    x = module.forward(x, cache = cache, attn_params = attn_params, past_len = past_len, loras = loras)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/textgen/lib/python3.11/site-packages/exllamav2/embedding.py", line 72, in forward
    hidden_states = self.embedding.forward(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/textgen/lib/python3.11/site-packages/torch/nn/modules/sparse.py", line 162, in forward
    return F.embedding(
           ^^^^^^^^^^^^
  File "/opt/conda/envs/textgen/lib/python3.11/site-packages/torch/nn/functional.py", line 2233, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: index out of range in self

System Info

Ubuntu 22.04
Nvidia 3090
Docker environment
ExllamaV2 v0.0.12

latest dev branch

Both of these work fine for me

python server.py --model bartowski_internlm2-math-20b-llama-exl2_4_25 --loader exllamav2_hf --listen
python server.py --model bartowski_internlm2-math-20b-llama-exl2_6_5 --loader exllamav2_hf --listen

Version

exllamav2                         0.0.12+cu121

Are you able to generate anything? Loading works but as soon as I try to prompt it it errors

Yes, both generated text.

Okay so reinstalling everything from scratch and specifying the torch version seemed to have fixed it, not sure what specific part but looks like it's working with HF now, sorry about that, closing this

@oobabooga sorry to resurrect, but I realized when I use the chat template (chatml) that that's when it breaks. Not using a template allows it to generate. May be a config issue, but I'm curious if you see the same, especially since the non HF worked with the chatml im_start tokens

I have reproduced the issue. It seems like the base ExLlamav2 tokenizer tokenizes <|im_start|> incorrectly:

1      -  '<s>'
333    -  '<'
352    -  '|'
449    -  'im'
5064   -  '_start'
352    -  '|'
330    -  '>'

whereas the HF tokenizer in ExLlamav2_HF recognizes the special token:

1      -  '<s>'
92549  -  '<|im_start|>'

which then causes the panic. It seems like a bug in ExLlamaV2.

That's interesting, cause the tokenizer config has im_start as 92543, I'll investigate more with that in mind and get back to you

Here are the last pieces in the SentencePiece model:

92530 [UNUSED_TOKEN_133]
92531 [UNUSED_TOKEN_134]
92532 [UNUSED_TOKEN_135]
92533 [UNUSED_TOKEN_136]
92534 [UNUSED_TOKEN_137]
92535 [UNUSED_TOKEN_138]
92536 [UNUSED_TOKEN_139]
92537 [UNUSED_TOKEN_140]
92538 [UNUSED_TOKEN_141]
92539 [UNUSED_TOKEN_142]
92540 [UNUSED_TOKEN_143]
92541 [UNUSED_TOKEN_144]
92542 [UNUSED_TOKEN_145]
92543 [UNUSED_TOKEN_146]

Normally extra tokens would be appended to the end of the vocabulary, but here I guess they're meant to override them instead. Going by tokenizer_config.json, <|im_start|> is meant to be token 92543:

0: "<unk>"
1: "<s>"
2: "</s>"
92538: "<|plugin|>"
92539: "<|interpreter|>"
92540: "<|action_end|>"
92541: "<|action_start|>"
92542: "<|im_end|>"
92543: "<|im_start|>"

I guess when the HF tokenizer loads tokenizer_config.json, it appends any new tokens to the end of the vocab, which would put <|im_start|> at position 92549. The ExLLama tokenizer doesn't read tokenizer_config.json, but it does load tokenizer.json and added_tokens.json.

So you could possibly create an added_tokens.json with the following:

{
    "<unk>": 0,
    "<s>": 1,
    "</s>": 2,
    "<|plugin|>": 92538,
    "<|interpreter|>": 92539,
    "<|action_end|>": 92540,
    "<|action_start|>": 92541,
    "<|im_end|>": 92542,
    "<|im_start|>": 92543
}

Not entirely sure if that's enough, and I don't have any more time to investigate right now. But it's maybe a place to start.

putting it in added_tokens.json doesn't fix the HF one (though I may rollback to when non-HF was in and try with that)

I wonder if what this really means is that the HF loader should recognize the overlap in tokenizer_config.json and replace the tokens, rather than appending, as you said that seems like it's exactly the problem. Or maybe there's a different way it's supposed to be done like with the special_tokens_map.json?

Yeah so adding the "added_tokens.json" did work for the non-HF loader, im_start gets mapped properly now:

1 : "\"

Yeah, it's still [UNUSED_TOKEN_145] as far as SentencePiece is concerned. It should recognize the added tokens when encoding with encode_special_tokens = True. Using decode_special_tokens = True when decoding should also turn ID 92542 into the string <|im_end|> if I'm not mistaken, but you'd normally not want that in the output either.

All ChatML models I've seen so far set the eos_token_id to the ID of whatever <|im_end|> is, so you could try changing that in config.json. The streaming generator should then stop before outputting the token. In theory.

Not sure if you noticed this @oobabooga but it seems likely that the fix is to overwrite rather than append the tokens if possible

This issue has been closed due to inactivity for 2 months. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

oobabooga / text-generation-webui