Closed nguyen-viet-hung closed 10 months ago
Hello,
I found that the issue comes from Llama of transformers, I update to version 4.33.0 then it can bypass now. Don't know why after installing AirLLM can make this error.
For anyone if face this issue, you can re-install transformers package by pip install -U transformers==4.33.0
or keep your working version before install AirLLM.
But now I get new issue that the inference process keep looping for a long time ....
But now I get new issue that the inference process keep looping for a long time ....
can you try setting max_new_tokens? Maybe try setting it to 2?
After I set max_new_tokens to 2, it takes 2 loop and stop and gives back my prompt
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cuda:0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:20<00:00, 1.68it/s]
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cuda:0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:20<00:00, 1.71it/s]
Trả lời: <s><s>[INST] <<SYS>>
You are a multilingual, helpful, respectful and honest assistant. Please always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share incorrect information
Nhập một câu truy vấn:
can you try the latest version: airllm-2.6.2?
I tried it here: https://github.com/lyogavin/Anima/blob/main/air_llm/tests/test_notebooks/test_sealllm.ipynb
it works.
Hello,
I tried with your latest version. With max_new_tokem = 2, it runs with 2 loops and return my prompt. if I set to 30 or higher, it run several loops and got error:
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.47it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00, 2.53it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.49it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.44it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00, 2.50it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.49it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00, 2.51it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.33it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00, 2.69it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00, 2.59it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:12<00:00, 2.89it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00, 2.66it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:12<00:00, 2.75it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00, 2.61it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00, 2.51it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00, 2.55it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00, 2.56it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.42it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.37it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00, 2.68it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.49it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.45it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.40it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00, 2.41it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:12<00:00, 2.87it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00, 2.64it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 3%|██ | 1/35 [00:00<00:23, 1.42it/s]
Traceback (most recent call last):
File "/home/coreai/hungnv/chatbot-llm/air_seallm_extract.py", line 83, in <module>
generation_output = model.generate(
File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/transformers/generation/utils.py", line 1718, in generate
return self.greedy_search(
File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/transformers/generation/utils.py", line 2579, in greedy_search
outputs = self(
File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/airllm/airllm_base.py", line 340, in __call__
return self.forward(*args, **kwargs)
File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/airllm/airllm_base.py", line 540, in forward
new_seq = layer(seq, **kwargs)[0]
File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 796, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 704, in forward
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 234, in apply_rotary_pos_emb
q_embed = (q * cos) + (rotate_half(q) * sin)
RuntimeError: The size of tensor a (513) must match the size of tensor b (512) at non-singleton dimension 2
(llm-lvd) [coreai@ai-workergpudev02 chatbot-llm]$
Hello,
I am testing AirLLM with model based on LlaMa-2. I successfully created splitted model. But when run inference, it got error. My code is below:
And here is the output with error:
Please help me what am I wrong?