intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.75k stars 1.27k forks source link

Inference speed and memory usage of Qwen1.5-14b #12015

Open WeiguangHan opened 2 months ago

WeiguangHan commented 2 months ago

I have tested the inference speed and memory usage of Qwen1.5-14b on my machine using the example in ipex-llm. The peek cpu usage to load Qwen1.5-14b in 4-bit is about 24GB. The peek GPU usage is about 10GB. The Inference speed is about 4~5 token/s. I export the environment variables set SYCL_CACHE_PERSISTENT=1 and set BIGDL_LLM_XMX_DISABLED=1. Does the inference speed and CPU/GPU memory usage meet the expectation? I think the CPU peak usage is too high and the speed is a little slow.

device Intel(R) Core(TM) Ultra 7 155H 3.80 GHz 32.0 GB (31.6 GB 可用) image

env intel-extension-for-pytorch 2.1.10+xpu torch 2.1.0a0+cxx11.abi transformers 4.44.2

JinheTang commented 2 months ago

Hi @WeiguangHan , we will take a look at this issue and try to reproduce it first. We'll let you know if there's any progress.

JinheTang commented 2 months ago

Hi @WeiguangHan , we can not reproduce the issue on an Ultra 5 125H CPU.

The CPU usage when running qwen1.5 example script turned out pretty normal: image given that the initial usage is about 9GB, the peak CPU memory usage for loading Qwen1.5-14B (converted to int4 using save.py) model is about 10GB. The inference speed is 9.2 tokens/sec when n-predict is set to default 32: image

Also, pls note that it is recommended to run performance evaluation with the all-in-one benchmark(https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one). Reference config: image and below is the demo output on our machine:

,model,1st token avg latency (ms),2+ avg latency (ms/token),encoder time (ms),input/output tokens,batch_size,actual input/output tokens,num_beams,low_bit,cpu_embedding,model loading time (s),peak mem (GB),streaming,use_fp16_torch_dtype
0,/Qwen1.5-14B-Chat,4517.94,96.96,0.0,1024-128,1,1024-128,1,sym_int4,False,16.18,9.94921875,False,N/A
WeiguangHan commented 2 months ago

Hi @WeiguangHan , we can not reproduce the issue on an Ultra 5 125H CPU.

The CPU usage when running qwen1.5 example script turned out pretty normal: image given that the initial usage is about 9GB, the peak CPU memory usage for loading Qwen1.5-14B (converted to int4 using save.py) model is about 10GB. The inference speed is 9.2 tokens/sec when n-predict is set to default 32: image

Also, pls note that it is recommended to run performance evaluation with the all-in-one benchmark(https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one). Reference config: image and below is the demo output on our machine:

,model,1st token avg latency (ms),2+ avg latency (ms/token),encoder time (ms),input/output tokens,batch_size,actual input/output tokens,num_beams,low_bit,cpu_embedding,model loading time (s),peak mem (GB),streaming,use_fp16_torch_dtype
0,/Qwen1.5-14B-Chat,4517.94,96.96,0.0,1024-128,1,1024-128,1,sym_int4,False,16.18,9.94921875,False,N/A

Thanks a lot. The CPU of my computer is Ultra 7 155H. It should have a better performance theoretically. I will try it again according to your instructions.