intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.48k stars 1.24k forks source link

Question about benchmark result #11664

Open xeasonx opened 1 month ago

xeasonx commented 1 month ago

I have used all-in-one benchmark to test on Intel Ultra 9 185H's NPU, The model used is Owen/Qwen2-7B. I'm confused about the result. In this repo's image shows tokens/s for 32 tokens/input is 19.6 under Intel Ultra 7 165H. My result in csv file is

1st token average latency(ms): 617.64
2+ avg latency(ms/token): 340.45
encoder time(ms): 0
input/ouput tokens: 32-32
batch_size: 1
actual input/output tokens: 32-32
num_beams: 1
low_bit: sym_int
cpu_embendding: False
model loading time(s): 88.19
peak mem(GB): N/A
streaming: False
use_fp16_torch_dtype: N/A

My questions are

my config is:

repo_id:
  - 'Qwen/Qwen2-7B'
local_model_hub: 'path/to/local/model'
warm_up: 1
num_trials: 3
num_beams: 1
low_bit: 'sym_int4'
batch_size: 1
in_out_pairs:
  - '32-32'
  - '1024-128'
test_api:
  - "transformers_int4_npu_win" 
cpu_embedding: False # whether put embedding to CPU
streaming: False
task: 'continuation'

When running the benchmark, I can see the NPU's resource is being used in task manager.

jason-dai commented 1 month ago

It uses iGPU; as mentioned in readme, please refer to [2][3][4] for more details.

grandxin commented 4 weeks ago

have you solved this problem? I also run the qwen2-7b(int4) example using NPU. Inference speed is too slow, only 2-3 tokens/s.