edulov71 commented 1 month ago

System Info

transformers version: 4.43.4
Platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35
Python version: 3.12.4
Huggingface_hub version: 0.24.5
Safetensors version: 0.4.3
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.3.0+cu121 (True)
Tensorflow version (GPU?): 2.16.1 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA GeForce RTX 2060 with Max-Q Desig

Who can help?

@Narsil @zucchini-nlp

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Just execute this slightly modified AirLLM code to get an error listed below. Earlier I've executed the same code for MAX_LENGTH = 1024 and max_new_tokens=128; it worked fine

from airllm import AutoModel

MAX_LENGTH = 4096

could use hugging face model repo id:

model = AutoModel.from_pretrained("/home/edh/Downloads/llama3_1/Meta-Llama-3.1-8B-Instruct")

input_text = [ 'Alice has 2 kids and at least one of them is a girl. What is the probability that the other child is also a girl?\n\ You can assume that there are an equal number of males and females in the world.\n\ A) 0.5\nB) 0.25\nC) 0.333\nD) 0.75' ]

input_tokens = model.tokenizer(input_text, return_tensors="pt", return_attention_mask=False, truncation=True, max_length=MAX_LENGTH, padding=False)

generation_output = model.generate( input_tokens['input_ids'].cuda(), max_new_tokens=512, use_cache=True, return_dict_in_generate=True, temperature=0.9)

output = model.tokenizer.decode(generation_output.sequences[0])

print(output)

Traceback (most recent call last): File "/home/edh/Downloads/airllm-example.py", line 24, in generation_output = model.generate( ^^^^^^^^^^^^^^^ File "/home/edh/.local/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/home/edh/.local/lib/python3.12/site-packages/transformers/generation/utils.py", line 1989, in generate result = self._sample( ^^^^^^^^^^^^^ File "/home/edh/.local/lib/python3.12/site-packages/transformers/generation/utils.py", line 2932, in _sample outputs = self(model_inputs, return_dict=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/edh/.local/lib/python3.12/site-packages/airllm/airllm_base.py", line 364, in call return self.forward(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/edh/.local/lib/python3.12/site-packages/airllm/airllm_base.py", line 564, in forward new_seq = layer(seq, kwargs)[0] ^^^^^^^^^^^^^^^^^^^^ File "/home/edh/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/edh/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/edh/.local/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 677, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( ^^^^^^^^^^^^^^^ File "/home/edh/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/edh/.local/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/edh/.local/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 578, in forward query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/edh/.local/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 219, in apply_rotary_pos_emb q_embed = (q cos) + (rotate_half(q) sin) ^~~ RuntimeError: The size of tensor a (513) must match the size of tensor b (512) at non-singleton dimension 2**

Expected behavior

When it runs, it just solves a task (wrong solution compared to the full 405B model), but it finishes the job printing the resulting text. Maybe the core of the problem lies in AirLLM, but this exact error is produced inside the Transformers package

zucchini-nlp commented 1 month ago

Hey @edulov71 !

The error message is related to position ids shape mismatch, and after a closer inspection seems like AirLLM is limiting max length to 512 tokens.

https://github.com/lyogavin/airllm/blob/b1e6311cc2ce7a6653d08578dcf5741fa5226fcd/air_llm/airllm/airllm_llama_mlx.py#L210

You can try to load the models by indicating max_seq_length=512 in from_pretrained yet I'm not sure if it will be used by the model. I recommend to open an issue in AirLLM repo if it doesn't

edulov71 commented 1 month ago

1.My GPU is not fastest, so checks took time. I was able to repeat the issue with the same parameters max_seq_len=512

Setting max_seq_len=4096 in from_pretrained(), i.e. big, but equal max_seq_len and max_length in tokenizer- it worked. Absolutely wrong result, but no crash
Afterwards I decided to set both params to 400. And it crashed again (for two different temperatures) with the same size issue: 401 vs 400 in the indicated function

So, it could be an AirLLM fault, but to pass it there first I want to know why it happens inside your's function after an execution of many-many layers: how/why one of the tensors increases its length by 1. Or it could be a fault inside LLama's 3.1-8B model?

Rergards,

zucchini-nlp commented 1 month ago

how/why one of the tensors increases its length by 1.

When you are generating a text, you have one more new token every step and the max length will go up to the max_new_tokens + input_length token. So you should adjust the max-length at loading with this in mind

Regading the garbage output, llama models should be able to generate long sequences, so I recommend to try and use model directly from transformers (haven't seen issue before) or open an issue on AirLLM repo 🤗

edulov71 commented 1 month ago

Thanks. One, simpler part of the riddle is solved. But only you, as a part of the developer team could say, if an error of adding "more" tokens above the limit is the internal issue of the LLama 3.1 model (forgot to add corresponding check(s) in code), I mean the result of some automatic actions, or it was forced somehow from outside (but you still lack the corresponding checks inside).

zucchini-nlp commented 1 month ago

No, Llama technically has no limit in max length thanks to RoPE scaling so this is limited manually by AirLLM (via max_seq_length at loading)

edulov71 commented 1 month ago

Thanks you, gonna report to AirLLM

huggingface / transformers

apply_rotary_pos_emb() Tensor size mismatch #32582

System Info

Who can help?

Information

Tasks

Reproduction

Just execute this slightly modified AirLLM code to get an error listed below. Earlier I've executed the same code for MAX_LENGTH = 1024 and max_new_tokens=128; it worked fine

could use hugging face model repo id:

print(output)

Expected behavior