Open sujikarNStarx opened 5 months ago
I can reproduce your issue. The point is that the llamav3 language model head is so big it takes a lot of time to compile that completely degrade user experience... We are working to improve compilation time of such large kernels. I'll keep you posted
Noted please. Thanks for swift response. I will wait for new updates.
I'm seeing a similar issue even with llama2-7b
. I can run with sequence length=128, but seq_len=2048 hangs and never returns.
I was able to run this same example successfully about a month ago with a previous driver version.
python profile_llm.py --model meta-llama/Llama-2-7b-hf --context-size 2048 --dtype int8
While, we are able to generate output for shorter prompts using NPU compiled llama3 model. If the prompt size is large, the model doesn’t generate any output. Needed Support -Is there a prefixed prompt size?
Code used from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer import intel_npu_acceleration_library import torch import os
model_id = "meta-llama/Meta-Llama-3-8B-Instruct" dtype = "int4"
PATH = os.path.join("models", model_id, dtype) filename = os.path.join(PATH, "model.pth") os.makedirs(PATH, exist_ok=True)
if not os.path.exists(filename): print("Compile model for the NPU") model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval() torch_dtype = torch.int8 if dtype == "int4" else torch.float16 with torch.no_grad(): model = intel_npu_acceleration_library.compile(model, dtype=torch_dtype) torch.save(model, filename) del model
print(f"Loading model from {filename}") tokenizer = AutoTokenizer.from_pretrained(model_id) model = torch.load(filename).eval() streamer = TextStreamer(tokenizer, skip_special_tokens=True, skip_prompt=True)
print("Run inference with Llama3 on NPU\n")
query = input(">”)
DEFAULT_normal_PROMPT = """ You are a helpful, respectful, and honest Hostess for the EUC Conference 2024.
Conference Agenda: DAY 1 AGENDA: Tuesday, June 11, 2024 Knox and Ridge facilities Tour: Transfer slot 1: 8:30 am; Transfer slot 2: 9:30 am; Transfer slot 3:10:20 am; Transfer slot 4:11:20 am; Transfer slot 5:12:00 pm; Transfer slot 6:1:00 pm; Transfer slot 7:1:15 pm. ● Transfers from Delta Marriott to Knox and Ridge Facilities for those who opted to join the tour. ● Transportation from the Delta Marriott to the Knox and Ridge Facilities will be 30 minutes prior to tour start time Location: Delta Marriott Lobb Knox and Ridge tour registration timings : 7:15 am. to 3:30 pm. ● Registration ● Knox and Ridge tour Registration Location: Delta Marriott Lobby Day 1 Breakfast Timings : 7:30 am. to 9:30 am. ● Continental Breakfast Day 1 Breakfast Location: Delta Marriott Knox & Ridge Tour Timings : 8:30 am to 3:40 pm. Knox & Ridge Integration Center Tours to Optional (You must be pre-registered. Please review your personalized appointment for your specific tour time)
messages = [ { "role": "system",
"content": DEFAULT_RAG_PROMPT,
]
input_ids = tokenizer.apply_chat_template( messages, add_generation_prompt=True, max_length= 7200, return_tensors="pt" ).to(model.device)
terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]
outputs = model.generate( input_ids, max_length = 7200,
max_new_tokens=256,
)
Error No error shows up but the response is also not generated.