lyogavin / airllm

AirLLM 70B inference with single 4GB GPU
Apache License 2.0
5.07k stars 408 forks source link

AirLLMLlama2 error: TypeError: llama_forward() got an unexpected keyword argument 'padding_mask' #67

Closed nguyen-viet-hung closed 10 months ago

nguyen-viet-hung commented 10 months ago

Hello,

I am testing AirLLM with model based on LlaMa-2. I successfully created splitted model. But when run inference, it got error. My code is below:

import torch
from airllm import AirLLMLlama2
from dbdc import build_dbdc
from langchain.embeddings import HuggingFaceEmbeddings

class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKCYAN = '\033[96m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

model_path = 'SeaLLMs/SeaLLM-7B-chat'
model = AirLLMLlama2('/home/coreai/.cache/huggingface/hub/models--SeaLLMs--SeaLLM-7B-chat/snapshots/515af2338223985d32ced3307c018899396a2967')

BOS_TOKEN = '<s>'
EOS_TOKEN = '</s>'

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

SYSTEM_PROMPT = """You are a multilingual, helpful, respectful and honest assistant. \
Please always answer as helpfully as possible, while being safe. Your \
answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure \
that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not \
correct. If you don't know the answer to a question, please don't share false information.

As a multilingual assistant, you must respond and follow instructions in the native language of the user by default, unless told otherwise. \
Your response should adapt to the norms and customs of the respective language and culture.

Only use the following pieces of context to answer the question at the end.

{context}

"""

TEXT = """CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM . Độc lập - Tự do - Hạnh phúc . SOCIALIST REPUBLIC OF VIET NAM . Independence - Freedom - Happiness . CĂN CƯỚC CÔNG DÂN . Citizen Identity Card . Số / No. : 095426014597 . Họ và tên / Full name : THẠCH SANG . Ngày sinh / Date of birth : 20 / 10 / 1987 . Giới tính / Sex : Nam Quốc tịch / Nationality : Việt Nam . Quê quán / Place of origin : Thủy Liễu , Gò Quao , Kiên Giang . Nơi thường trú / Place of residence : Khu Phố Minh An . TT.Minh Lương , Châu Thành , Kiên Giang . Date of expiry . Có giá trị đến : 20 / 10 / 2027"""

while True:
    query = input("\nNhập một câu truy vấn: ")
    if query in ["exit", 'thoát']:
        break
    if query.strip() == "":
        continue

    print(f"{bcolors.WARNING}Thông tin dùng để trả lời: {TEXT}{bcolors.ENDC}")

    int_prompt = SYSTEM_PROMPT.format_map({'context': TEXT})

    input_prompt = f"{BOS_TOKEN}{B_INST} {B_SYS} {int_prompt} {E_SYS} Question: {query} {E_INST}\nAnswer:"

    input_ids = model.tokenizer([input_prompt],
        return_tensors="pt", 
        return_attention_mask=True, 
        truncation=True, 
        max_length=128, 
        padding=False)

    generation_output = model.generate(
        input_ids['input_ids'].cuda(), 
        max_new_tokens=1024,
        use_cache=True,
        return_dict_in_generate=True)

    output = model.tokenizer.decode(generation_output.sequences[0])
    print(f"{bcolors.OKCYAN}Trả lời: {output}{bcolors.ENDC}")

And here is the output with error:

The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cuda:0:   3%|██▊                                                                                                 | 1/35 [00:01<00:37,  1.12s/it]
Traceback (most recent call last):
  File "/home/coreai/hungnv/chatbot-llm/air_seallm_extract.py", line 89, in <module>
    generation_output = model.generate(
  File "/home/coreai/anaconda3/envs/chatbot-llm/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/coreai/anaconda3/envs/chatbot-llm/lib/python3.10/site-packages/transformers/generation/utils.py", line 1606, in generate
    return self.greedy_search(
  File "/home/coreai/anaconda3/envs/chatbot-llm/lib/python3.10/site-packages/transformers/generation/utils.py", line 2454, in greedy_search
    outputs = self(
  File "/home/coreai/anaconda3/envs/chatbot-llm/lib/python3.10/site-packages/airllm/airllm.py", line 193, in __call__
    return self.forward(*args, **kwargs)
  File "/home/coreai/anaconda3/envs/chatbot-llm/lib/python3.10/site-packages/airllm/airllm.py", line 315, in forward
    new_seq, (k_cache, v_cache) = layer(seq,
  File "/home/coreai/anaconda3/envs/chatbot-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/coreai/anaconda3/envs/chatbot-llm/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 635, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/coreai/anaconda3/envs/chatbot-llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/coreai/anaconda3/envs/chatbot-llm/lib/python3.10/site-packages/optimum/bettertransformer/models/decoder_models.py", line 426, in forward
    return llama_forward(self, *args, **kwargs)
TypeError: llama_forward() got an unexpected keyword argument 'padding_mask'
(chatbot-llm) [coreai@ai-workergpudev02 chatbot-llm]$ 

Please help me what am I wrong?

nguyen-viet-hung commented 10 months ago

Hello,

I found that the issue comes from Llama of transformers, I update to version 4.33.0 then it can bypass now. Don't know why after installing AirLLM can make this error.

For anyone if face this issue, you can re-install transformers package by pip install -U transformers==4.33.0 or keep your working version before install AirLLM.

But now I get new issue that the inference process keep looping for a long time ....

lyogavin commented 10 months ago

But now I get new issue that the inference process keep looping for a long time ....

can you try setting max_new_tokens? Maybe try setting it to 2?

nguyen-viet-hung commented 10 months ago

After I set max_new_tokens to 2, it takes 2 loop and stop and gives back my prompt

The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cuda:0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:20<00:00,  1.68it/s]
The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.
cuda:0: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:20<00:00,  1.71it/s]
Trả lời: <s><s>[INST] <<SYS>>
 You are a multilingual, helpful, respectful and honest assistant. Please always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share incorrect information

Nhập một câu truy vấn: 
lyogavin commented 10 months ago

can you try the latest version: airllm-2.6.2?

I tried it here: https://github.com/lyogavin/Anima/blob/main/air_llm/tests/test_notebooks/test_sealllm.ipynb

it works.

nguyen-viet-hung commented 10 months ago

Hello,

I tried with your latest version. With max_new_tokem = 2, it runs with 2 loops and return my prompt. if I set to 30 or higher, it run several loops and got error:

new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00,  2.47it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00,  2.53it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00,  2.49it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00,  2.44it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00,  2.50it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00,  2.49it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00,  2.51it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00,  2.33it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00,  2.69it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00,  2.59it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:12<00:00,  2.89it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00,  2.66it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:12<00:00,  2.75it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00,  2.61it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00,  2.51it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00,  2.55it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00,  2.56it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00,  2.42it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00,  2.37it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00,  2.68it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00,  2.49it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00,  2.45it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00,  2.40it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:14<00:00,  2.41it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:12<00:00,  2.87it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device): 100%|██████████████████████████████████████████████████████████████████████| 35/35 [00:13<00:00,  2.64it/s]
new version of transfomer, no need to use BetterTransformer, try setting attn impl to sdpa...
attn imp: <class 'transformers.models.llama.modeling_llama.LlamaSdpaAttention'>
running layers(self.running_device):   3%|██                                                                     | 1/35 [00:00<00:23,  1.42it/s]
Traceback (most recent call last):
  File "/home/coreai/hungnv/chatbot-llm/air_seallm_extract.py", line 83, in <module>
    generation_output = model.generate(
  File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/transformers/generation/utils.py", line 1718, in generate
    return self.greedy_search(
  File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/transformers/generation/utils.py", line 2579, in greedy_search
    outputs = self(
  File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/airllm/airllm_base.py", line 340, in __call__
    return self.forward(*args, **kwargs)
  File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/airllm/airllm_base.py", line 540, in forward
    new_seq = layer(seq, **kwargs)[0]
  File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 796, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 704, in forward
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
  File "/home/coreai/anaconda3/envs/llm-lvd/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 234, in apply_rotary_pos_emb
    q_embed = (q * cos) + (rotate_half(q) * sin)
RuntimeError: The size of tensor a (513) must match the size of tensor b (512) at non-singleton dimension 2
(llm-lvd) [coreai@ai-workergpudev02 chatbot-llm]$