sujikarNStarx commented 5 months ago

While, we are able to generate output for shorter prompts using NPU compiled llama3 model. If the prompt size is large, the model doesn’t generate any output. Needed Support -Is there a prefixed prompt size?

How can we increase the allowed prompt size?
How to use the maximum context length ?
If we are missing any hyper parameter, please suggest

Code used from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer import intel_npu_acceleration_library import torch import os

model_id = "meta-llama/Meta-Llama-3-8B-Instruct" dtype = "int4"

PATH = os.path.join("models", model_id, dtype) filename = os.path.join(PATH, "model.pth") os.makedirs(PATH, exist_ok=True)

if not os.path.exists(filename): print("Compile model for the NPU") model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval() torch_dtype = torch.int8 if dtype == "int4" else torch.float16 with torch.no_grad(): model = intel_npu_acceleration_library.compile(model, dtype=torch_dtype) torch.save(model, filename) del model

print(f"Loading model from {filename}") tokenizer = AutoTokenizer.from_pretrained(model_id) model = torch.load(filename).eval() streamer = TextStreamer(tokenizer, skip_special_tokens=True, skip_prompt=True)

print("Run inference with Llama3 on NPU\n")

query = input(">”)

DEFAULT_normal_PROMPT = """ You are a helpful, respectful, and honest Hostess for the EUC Conference 2024.

Your main role is to provide information and support to attendees, ensuring clarity, accuracy, and a welcoming tone.

Conference Agenda: DAY 1 AGENDA: Tuesday, June 11, 2024 Knox and Ridge facilities Tour: Transfer slot 1: 8:30 am; Transfer slot 2: 9:30 am; Transfer slot 3:10:20 am; Transfer slot 4:11:20 am; Transfer slot 5:12:00 pm; Transfer slot 6:1:00 pm; Transfer slot 7:1:15 pm. ● Transfers from Delta Marriott to Knox and Ridge Facilities for those who opted to join the tour. ● Transportation from the Delta Marriott to the Knox and Ridge Facilities will be 30 minutes prior to tour start time Location: Delta Marriott Lobb Knox and Ridge tour registration timings : 7:15 am. to 3:30 pm. ● Registration ● Knox and Ridge tour Registration Location: Delta Marriott Lobby Day 1 Breakfast Timings : 7:30 am. to 9:30 am. ● Continental Breakfast Day 1 Breakfast Location: Delta Marriott Knox & Ridge Tour Timings : 8:30 am to 3:40 pm. Knox & Ridge Integration Center Tours to Optional (You must be pre-registered. Please review your personalized appointment for your specific tour time)

Tour 1 8:30 am. to 10:55 am. 2. Tour 2 9:30 am. to 11:55 am. 3. Tour 3 10:20 am. to 12:35 pm. 4. Tour 4 11:20 am. to 1:35 pm. 5. Tour 5 12:00 pm. to 2:25 pm. 6. Tour 6 1:00 pm. to 3:25 pm. 7. Tour 7: 1:15 pm. to 3:40 pm. Knox : Discover why the world largest organizations work with SHI to accelerate time-to-value for end-user computing investments. Experience the thrill and see first-hand how we help organizations improve employee services, streamline the supply chain, and achieve financial, operational, and sustainability goals. Get your popcorn ready and enjoy the show as SHI shows off how we can create a world of services, just for you! Ridge: Home to the SHI Hardware Life cycle Management and Integrated Data Center Solutions offerings, SHI Ridge is a 400,000 sq ft facility for organizations that require End-to-end life cycle services. Location: Knox & Ridge Integration Center Tours """

messages = [ { "role": "system",

"content": DEFAULT_RAG_PROMPT,

    "content": DEFAULT_normal_PROMPT,
},
{"role": "user", "content": query},

]

input_ids = tokenizer.apply_chat_template( messages, add_generation_prompt=True, max_length= 7200, return_tensors="pt" ).to(model.device)

terminators = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")]

outputs = model.generate( input_ids, max_length = 7200,

max_new_tokens=256,

eos_token_id=terminators,
do_sample=False,
streamer=streamer,

)

Error No error shows up but the response is also not generated.

alessandropalla commented 5 months ago

I can reproduce your issue. The point is that the llamav3 language model head is so big it takes a lot of time to compile that completely degrade user experience... We are working to improve compilation time of such large kernels. I'll keep you posted

sujikarNStarx commented 5 months ago

Noted please. Thanks for swift response. I will wait for new updates.

rradjabi commented 4 months ago

I'm seeing a similar issue even with llama2-7b. I can run with sequence length=128, but seq_len=2048 hangs and never returns.

I was able to run this same example successfully about a month ago with a previous driver version.

Steps to reproduce

python profile_llm.py --model meta-llama/Llama-2-7b-hf --context-size 2048 --dtype int8

intel / intel-npu-acceleration-library

NPU compiled Llama3 (int4) model not working if prompt size is large #40

"content": DEFAULT_RAG_PROMPT,

max_new_tokens=256,

Steps to reproduce