intel / intel-npu-acceleration-library

Intel® NPU Acceleration Library
Apache License 2.0
499 stars 54 forks source link

It seems that the model is not loaded on NPU Memory #138

Open BigYellowTiger opened 21 hours ago

BigYellowTiger commented 21 hours ago

Describe the bug My CPU is Ultra 7 258v, and the system is Windows 11Home 24H2. I just tried running the qwen2.5-7b-instruct-model using your example code for the first time. However, I noticed through the Task Manager that the model does not seem to be loaded into NPU memory (both NPU memory and GPU memory utilization remain at 0%), but instead, it is loaded into RAM. Additionally, the subsequent inference process seems very slow, approximately 1 second per token. Below are my code and Task Manager screenshots. Is this the expected behavior, or is there something in the example code that needs to be modified?

`from torch.profiler import profile, ProfilerActivity from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM from threading import Thread import intel_npu_acceleration_library import torch

model_id = "C:/all_project/all_llm_model/qwen2.5_7b_instruct/"

model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval() tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True) tokenizer.pad_token_id = tokenizer.eos_token_id streamer = TextStreamer(tokenizer, skip_special_tokens=True)

print("Compile model for the NPU") model = intel_npu_acceleration_library.compile(model, dtype=torch.int8)

print("Run inference")

query = input("user: ") prefix = tokenizer(query, return_tensors="pt")["input_ids"]

generation_kwargs = dict( max_new_tokens=1000, input_ids=prefix, streamer=streamer, do_sample=True, top_k=50, topp=0.9, ) = model.generate(**generation_kwargs)`

Screenshots 2fab31c0-4df2-4e51-812b-3cff5c7cead2

Desktop (please complete the following information):

BigYellowTiger commented 20 hours ago

I just tested it, and when running qwen1.5b on the NPU, the inference speed is approximately 0.26 seconds per token. However, when running qwen1.5b purely on the CPU, the inference speed is 0.13 seconds per token. So, does this mean that NPU inference is actually slower than pure CPU inference?