Closed edulov71 closed 1 month ago
Hey @edulov71 !
The error message is related to position ids shape mismatch, and after a closer inspection seems like AirLLM is limiting max length to 512 tokens.
You can try to load the models by indicating max_seq_length=512
in from_pretrained
yet I'm not sure if it will be used by the model. I recommend to open an issue in AirLLM repo if it doesn't
1.My GPU is not fastest, so checks took time. I was able to repeat the issue with the same parameters max_seq_len=512
So, it could be an AirLLM fault, but to pass it there first I want to know why it happens inside your's function after an execution of many-many layers: how/why one of the tensors increases its length by 1. Or it could be a fault inside LLama's 3.1-8B model?
Rergards,
how/why one of the tensors increases its length by 1.
When you are generating a text, you have one more new token every step and the max length will go up to the max_new_tokens + input_length
token. So you should adjust the max-length at loading with this in mind
Regading the garbage output, llama models should be able to generate long sequences, so I recommend to try and use model directly from transformers (haven't seen issue before) or open an issue on AirLLM repo 🤗
Thanks. One, simpler part of the riddle is solved. But only you, as a part of the developer team could say, if an error of adding "more" tokens above the limit is the internal issue of the LLama 3.1 model (forgot to add corresponding check(s) in code), I mean the result of some automatic actions, or it was forced somehow from outside (but you still lack the corresponding checks inside).
No, Llama technically has no limit in max length thanks to RoPE scaling so this is limited manually by AirLLM (via max_seq_length at loading)
Thanks you, gonna report to AirLLM
System Info
transformers
version: 4.43.4Who can help?
@Narsil @zucchini-nlp
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Just execute this slightly modified AirLLM code to get an error listed below. Earlier I've executed the same code for MAX_LENGTH = 1024 and max_new_tokens=128; it worked fine
from airllm import AutoModel
MAX_LENGTH = 4096
could use hugging face model repo id:
model = AutoModel.from_pretrained("/home/edh/Downloads/llama3_1/Meta-Llama-3.1-8B-Instruct")
input_text = [ 'Alice has 2 kids and at least one of them is a girl. What is the probability that the other child is also a girl?\n\ You can assume that there are an equal number of males and females in the world.\n\ A) 0.5\nB) 0.25\nC) 0.333\nD) 0.75' ]
input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, padding=False)
generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=512, use_cache=True, return_dict_in_generate=True, temperature=0.9)
output = model.tokenizer.decode(generation_output.sequences[0])
print(output)
Traceback (most recent call last): File "/home/edh/Downloads/airllm-example.py", line 24, in
generation_output = model.generate(
^^^^^^^^^^^^^^^
File "/home/edh/.local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/edh/.local/lib/python3.12/site-packages/transformers/generation/utils.py", line 1989, in generate
result = self._sample(
^^^^^^^^^^^^^
File "/home/edh/.local/lib/python3.12/site-packages/transformers/generation/utils.py", line 2932, in _sample
outputs = self(model_inputs, return_dict=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/edh/.local/lib/python3.12/site-packages/airllm/airllm_base.py", line 364, in call
return self.forward(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/edh/.local/lib/python3.12/site-packages/airllm/airllm_base.py", line 564, in forward
new_seq = layer(seq, kwargs)[0]
^^^^^^^^^^^^^^^^^^^^
File "/home/edh/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/edh/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/edh/.local/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 677, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "/home/edh/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/edh/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/edh/.local/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 578, in forward
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/edh/.local/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 219, in apply_rotary_pos_emb
q_embed = (q cos) + (rotate_half(q) sin)
^~~ RuntimeError: The size of tensor a (513) must match the size of tensor b (512) at non-singleton dimension 2**Expected behavior
When it runs, it just solves a task (wrong solution compared to the full 405B model), but it finishes the job printing the resulting text. Maybe the core of the problem lies in AirLLM, but this exact error is produced inside the Transformers package