Question about benchmark result

xeasonx commented 1 month ago

I have used all-in-one benchmark to test on Intel Ultra 9 185H's NPU, The model used is Owen/Qwen2-7B. I'm confused about the result. In this repo's image shows tokens/s for 32 tokens/input is 19.6 under Intel Ultra 7 165H. My result in csv file is

1st token average latency(ms): 617.64
2+ avg latency(ms/token): 340.45
encoder time(ms): 0
input/ouput tokens: 32-32
batch_size: 1
actual input/output tokens: 32-32
num_beams: 1
low_bit: sym_int
cpu_embendding: False
model loading time(s): 88.19
peak mem(GB): N/A
streaming: False
use_fp16_torch_dtype: N/A

My questions are

Is "tokens/s" calculated from "2+ avg latency(ms/token)"? If so, that will be 1000 / 340.45 = 2.94
Is the benchmark result provided by this repo using cpu, igpu, or npu? If npu is used during the benchmark, then my result is far from 19.6.

my config is:

repo_id:
  - 'Qwen/Qwen2-7B'
local_model_hub: 'path/to/local/model'
warm_up: 1
num_trials: 3
num_beams: 1
low_bit: 'sym_int4'
batch_size: 1
in_out_pairs:
  - '32-32'
  - '1024-128'
test_api:
  - "transformers_int4_npu_win" 
cpu_embedding: False # whether put embedding to CPU
streaming: False
task: 'continuation'

When running the benchmark, I can see the NPU's resource is being used in task manager.

jason-dai commented 1 month ago

It uses iGPU; as mentioned in readme, please refer to [2][3][4] for more details.

grandxin commented 4 weeks ago

have you solved this problem? I also run the qwen2-7b(int4) example using NPU. Inference speed is too slow, only 2-3 tokens/s.

intel-analytics / ipex-llm

Question about benchmark result #11664