It seems that the model is not loaded on NPU Memory

Describe the bug My CPU is Ultra 7 258v, and the system is Windows 11Home 24H2. I just tried running the qwen2.5-7b-instruct-model using your example code for the first time. However, I noticed through the Task Manager that the model does not seem to be loaded into NPU memory (both NPU memory and GPU memory utilization remain at 0%), but instead, it is loaded into RAM. Additionally, the subsequent inference process seems very slow, approximately 1 second per token. Below are my code and Task Manager screenshots. Is this the expected behavior, or is there something in the example code that needs to be modified?

`from torch.profiler import profile, ProfilerActivity from transformers import AutoTokenizer, TextStreamer, AutoModelForCausalLM from threading import Thread import intel_npu_acceleration_library import torch

model_id = "C:/all_project/all_llm_model/qwen2.5_7b_instruct/"

model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval() tokenizer = AutoTokenizer.from_pretrained(model_id, use_default_system_prompt=True) tokenizer.pad_token_id = tokenizer.eos_token_id streamer = TextStreamer(tokenizer, skip_special_tokens=True)

print("Compile model for the NPU") model = intel_npu_acceleration_library.compile(model, dtype=torch.int8)

print("Run inference")

query = input("user: ") prefix = tokenizer(query, return_tensors="pt")["input_ids"]

generation_kwargs = dict( max_new_tokens=1000, input_ids=prefix, streamer=streamer, do_sample=True, top_k=50, topp=0.9, ) = model.generate(**generation_kwargs)`

Screenshots 2fab31c0-4df2-4e51-812b-3cff5c7cead2

Desktop (please complete the following information):

OS: Windows 11Home 24H2

intel / intel-npu-acceleration-library

It seems that the model is not loaded on NPU Memory #138