Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
I have used all-in-one benchmark to test on Intel Ultra 9 185H's NPU, The model used is Owen/Qwen2-7B.
I'm confused about the result. In this repo's image shows tokens/s for 32 tokens/input is 19.6 under Intel Ultra 7 165H.
My result in csv file is
I have used all-in-one benchmark to test on Intel Ultra 9 185H's NPU, The model used is Owen/Qwen2-7B. I'm confused about the result. In this repo's image shows tokens/s for 32 tokens/input is 19.6 under Intel Ultra 7 165H. My result in csv file is
My questions are
my config is:
When running the benchmark, I can see the NPU's resource is being used in task manager.