Open WeiguangHan opened 2 months ago
Hi @WeiguangHan , we will take a look at this issue and try to reproduce it first. We'll let you know if there's any progress.
Hi @WeiguangHan , we can not reproduce the issue on an Ultra 5 125H CPU.
The CPU usage when running qwen1.5 example script turned out pretty normal:
given that the initial usage is about 9GB, the peak CPU memory usage for loading Qwen1.5-14B (converted to int4 using save.py
) model is about 10GB. The inference speed is 9.2 tokens/sec when n-predict
is set to default 32:
Also, pls note that it is recommended to run performance evaluation with the all-in-one
benchmark(https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one).
Reference config:
and below is the demo output on our machine:
,model,1st token avg latency (ms),2+ avg latency (ms/token),encoder time (ms),input/output tokens,batch_size,actual input/output tokens,num_beams,low_bit,cpu_embedding,model loading time (s),peak mem (GB),streaming,use_fp16_torch_dtype
0,/Qwen1.5-14B-Chat,4517.94,96.96,0.0,1024-128,1,1024-128,1,sym_int4,False,16.18,9.94921875,False,N/A
Hi @WeiguangHan , we can not reproduce the issue on an Ultra 5 125H CPU.
The CPU usage when running qwen1.5 example script turned out pretty normal: given that the initial usage is about 9GB, the peak CPU memory usage for loading Qwen1.5-14B (converted to int4 using
save.py
) model is about 10GB. The inference speed is 9.2 tokens/sec whenn-predict
is set to default 32:Also, pls note that it is recommended to run performance evaluation with the
all-in-one
benchmark(https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one). Reference config: and below is the demo output on our machine:,model,1st token avg latency (ms),2+ avg latency (ms/token),encoder time (ms),input/output tokens,batch_size,actual input/output tokens,num_beams,low_bit,cpu_embedding,model loading time (s),peak mem (GB),streaming,use_fp16_torch_dtype 0,/Qwen1.5-14B-Chat,4517.94,96.96,0.0,1024-128,1,1024-128,1,sym_int4,False,16.18,9.94921875,False,N/A
Thanks a lot. The CPU of my computer is Ultra 7 155H. It should have a better performance theoretically. I will try it again according to your instructions.
I have tested the inference speed and memory usage of Qwen1.5-14b on my machine using the example in ipex-llm. The peek cpu usage to load Qwen1.5-14b in 4-bit is about 24GB. The peek GPU usage is about 10GB. The Inference speed is about 4~5 token/s. I export the environment variables
set SYCL_CACHE_PERSISTENT=1
andset BIGDL_LLM_XMX_DISABLED=1
. Does the inference speed and CPU/GPU memory usage meet the expectation? I think the CPU peak usage is too high and the speed is a little slow.device Intel(R) Core(TM) Ultra 7 155H 3.80 GHz 32.0 GB (31.6 GB 可用)
env intel-extension-for-pytorch 2.1.10+xpu torch 2.1.0a0+cxx11.abi transformers 4.44.2