intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.69k stars 1.26k forks source link

Run llama2-chat-hf with tranformers 4.38.1 failed #10249

Open qiuxin2012 opened 8 months ago

qiuxin2012 commented 8 months ago

Get below error:

<class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>
Keyword arguments {'add_special_tokens': False} not recognized.
Keyword arguments {'add_special_tokens': False} not recognized.
/home/arda/xin/BigDL-xin/python/llm/dev/benchmark/all-in-one/../benchmark_util.py:1295: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
2024-02-27 10:27:43,904 - ERROR - 

****************************Usage Error************************
Attention mask should be of size (1, 1, 33, 33), but is torch.Size([1, 1, 4096, 4096])
2024-02-27 10:27:43,904 - ERROR - 

****************************Call Stack*************************
Exception in thread Thread-4:
Traceback (most recent call last):
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/arda/xin/BigDL-xin/python/llm/dev/benchmark/all-in-one/run.py", line 62, in run_model_in_thread
    output_ids = model.generate(input_ids, do_sample=False, max_new_tokens=out_len,
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/arda/xin/BigDL-xin/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 1563, in generate
    return self.greedy_search(
  File "/home/arda/xin/BigDL-xin/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 2385, in greedy_search
    outputs = self(
  File "/home/arda/xin/BigDL-xin/python/llm/dev/benchmark/all-in-one/../benchmark_util.py", line 533, in __call__
    return self.model(*args, **kwargs)
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1168, in forward
    outputs = self.model(
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 1008, in forward
    layer_outputs = decoder_layer(
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/site-packages/bigdl/llm/transformers/models/llama.py", line 190, in llama_decoder_forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/site-packages/bigdl/llm/transformers/models/llama.py", line 1047, in llama_attention_forward_4_36
    attn_output, attn_weights = native_sdp(query_states, key_states, value_states,
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/site-packages/bigdl/llm/transformers/models/llama.py", line 1090, in native_sdp
    invalidInputError(False,
  File "/home/arda/anaconda3/envs/xin-llm/lib/python3.9/site-packages/bigdl/llm/utils/common/log4Error.py", line 32, in invalidInputError
    raise RuntimeError(errMsg)
RuntimeError: Attention mask should be of size (1, 1, 33, 33), but is torch.Size([1, 1, 4096, 4096])
qiuxin2012 commented 8 months ago

Caused by https://github.com/huggingface/transformers/blob/a0857740c0e6127485c11476650314df3accc2b6/src/transformers/models/llama/modeling_llama.py#L369