[Llama3.1 8B] Need pass your input's `attention_mask` to obtain reliable results.

ChenYuYeh commented 1 month ago

Describe the bug

While modified llama3 to llama3.1 as "meta-llama/Meta-Llama-3.1-8B-Instruct". The model can be managed to download. However it prompt error while sending the input.

The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.

Desktop (please complete the following information):

OS: Win11

error.log

alessandropalla commented 1 month ago

From the log I see that the error is TypeError: LlamaAttention.forward() got an unexpected keyword argument 'position_embeddings'

Already fixed by https://github.com/intel/intel-npu-acceleration-library/commit/2c8997bc99ddfc65022c215952679620c788564a

ChenYuYeh commented 1 month ago

sorry it won't work.

this commit is already there. commit 2c8997bc99ddfc65022c215952679620c788564a Author: Nagico2 nagico@qq.com Date: Thu Jul 25 17:55:24 2024 +0800

Add the position_imbeddings param to LlamaAttention.forward (#105)
the new log is simpler.

Run inference with Llama3.1 on NPU The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. Setting pad_token_id to eos_token_id:128009 for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results. terminate called after throwing an instance of 'ov::Exception' what(): Exception from src/inference/src/cpp/compiled_model.cpp:128: Exception from src/plugins/intel_npu/src/backend/src/zero_infer_request.cpp:279: Unsupported tensor precision: i4! Supported precisions: FP32, FP16, U8, I8, U16, I16, U32, I32, U64, I64

Note: the same error when running llama3.py

ChenYuYeh commented 1 month ago

@alessandropalla the issue can be temporarily workaround by using int8. But, I would like to use INT4 for faster response time. Could you help advice the root cause? Thanks.

--- a/examples/llama3.py +++ b/examples/llama3.py @@ -4,12 +4,12 @@ from transformers import AutoTokenizer, TextStreamer -from intel_npu_acceleration_library import NPUModelForCausalLM, int4 +from intel_npu_acceleration_library import NPUModelForCausalLM, int8 from intel_npu_acceleration_library.compiler import CompilerConfig model_id = "meta-llama/Meta-Llama-3-8B-Instruct" -compiler_conf = CompilerConfig(dtype=int4) +compiler_conf = CompilerConfig(dtype=int8) model = NPUModelForCausalLM.from_pretrained(

game202019 commented 3 weeks ago

I use this 2c8997b but it doesn't work. I still have this error " TypeError: LlamaAttention.forward() got an unexpected keyword argument 'position_embeddings' " Does it have new solution now?

ChenYuYeh commented 3 weeks ago

It won't work for me either. Have you verified INT8?

I use this 2c8997b but it doesn't work. I still have this error " TypeError: LlamaAttention.forward() got an unexpected keyword argument 'position_embeddings' " Does it have new solution now?

game202019 commented 2 weeks ago

sorry I don't know how to verify INT8? I just want to test this function as below: https://www.youtube.com/watch?v=BNUDvscfrgc and I meet " TypeError: LlamaAttention.forward() got an unexpected keyword argument 'position_embeddings' " :(

intel / intel-npu-acceleration-library

[Llama3.1 8B] Need pass your input's `attention_mask` to obtain reliable results. #109