Open BigYellowTiger opened 3 weeks ago
I just tested it, and when running qwen1.5b on the NPU, the inference speed is approximately 0.26 seconds per token. However, when running qwen1.5b purely on the CPU, the inference speed is 0.13 seconds per token. So, does this mean that NPU inference is actually slower than pure CPU inference?
Describe the bug My CPU is Ultra 7 258v, and the system is Windows 11Home 24H2. I just tried running the qwen2.5-7b-instruct-model using your example code for the first time. However, I noticed through the Task Manager that the model does not seem to be loaded into NPU memory (both NPU memory and GPU memory utilization remain at 0%), but instead, it is loaded into RAM. Additionally, the subsequent inference process seems very slow, approximately 1 second per token. Below are my code and Task Manager screenshots. Is this the expected behavior, or is there something in the example code that needs to be modified?
`from torch.profiler import profile, ProfilerActivity from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM from threading import Thread import intel_npu_acceleration_library import torch
model_id = "C:/all_project/all_llm_model/qwen2.5_7b_instruct/"
model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval() tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True) tokenizer.pad_token_id = tokenizer.eos_token_id streamer = TextStreamer(tokenizer, skip_special_tokens=True)
print("Compile model for the NPU") model = intel_npu_acceleration_library.compile(model, dtype=torch.int8)
print("Run inference")
query = input("user: ") prefix = tokenizer(query, return_tensors="pt")["input_ids"]
generation_kwargs = dict( max_new_tokens=1000, input_ids=prefix, streamer=streamer, do_sample=True, top_k=50, topp=0.9, ) = model.generate(**generation_kwargs)`
Screenshots
Desktop (please complete the following information):