Open ChenYuYeh opened 1 month ago
From the log I see that the error is
TypeError: LlamaAttention.forward() got an unexpected keyword argument 'position_embeddings'
Already fixed by https://github.com/intel/intel-npu-acceleration-library/commit/2c8997bc99ddfc65022c215952679620c788564a
sorry it won't work.
this commit is already there. commit 2c8997bc99ddfc65022c215952679620c788564a Author: Nagico2 nagico@qq.com Date: Thu Jul 25 17:55:24 2024 +0800
Add the position_imbeddings param to LlamaAttention.forward (#105)
the new log is simpler.
Run inference with Llama3.1 on NPU The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's
attention_mask
to obtain reliable results. Settingpad_token_id
toeos_token_id
:128009 for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input'sattention_mask
to obtain reliable results. terminate called after throwing an instance of 'ov::Exception' what(): Exception from src/inference/src/cpp/compiled_model.cpp:128: Exception from src/plugins/intel_npu/src/backend/src/zero_infer_request.cpp:279: Unsupported tensor precision: i4! Supported precisions: FP32, FP16, U8, I8, U16, I16, U32, I32, U64, I64
Note: the same error when running llama3.py
@alessandropalla the issue can be temporarily workaround by using int8. But, I would like to use INT4 for faster response time. Could you help advice the root cause? Thanks.
--- a/examples/llama3.py +++ b/examples/llama3.py @@ -4,12 +4,12 @@ from transformers import AutoTokenizer, TextStreamer -from intel_npu_acceleration_library import NPUModelForCausalLM, int4 +from intel_npu_acceleration_library import NPUModelForCausalLM, int8 from intel_npu_acceleration_library.compiler import CompilerConfig model_id = "meta-llama/Meta-Llama-3-8B-Instruct" -compiler_conf = CompilerConfig(dtype=int4) +compiler_conf = CompilerConfig(dtype=int8) model = NPUModelForCausalLM.from_pretrained(
I use this 2c8997b but it doesn't work. I still have this error " TypeError: LlamaAttention.forward() got an unexpected keyword argument 'position_embeddings' " Does it have new solution now?
It won't work for me either. Have you verified INT8?
I use this 2c8997b but it doesn't work. I still have this error " TypeError: LlamaAttention.forward() got an unexpected keyword argument 'position_embeddings' " Does it have new solution now?
sorry I don't know how to verify INT8? I just want to test this function as below: https://www.youtube.com/watch?v=BNUDvscfrgc and I meet " TypeError: LlamaAttention.forward() got an unexpected keyword argument 'position_embeddings' " :(
Describe the bug
While modified llama3 to llama3.1 as "meta-llama/Meta-Llama-3.1-8B-Instruct". The model can be managed to download. However it prompt error while sending the input.
Desktop (please complete the following information):
error.log