intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.42k stars 1.23k forks source link

chatglm2-6b performance not good on Arc770 #10587

Open qing-xu-intel opened 4 months ago

qing-xu-intel commented 4 months ago

1) python code: test.txt 2) pip requirement: requirements.txt 3) model link: https://hf-mirror.com/THUDM/chatglm2-6b/tree/main 4) other pip install: pip install torch==2.0.1a0+cxx11.abi torchvision==0.15.2a0+cxx11.abi intel_extension_for_pytorch==2.0.110+xpu -f https://developer.intel.com/ipex-whl-stable-xpu 5) oneAPI version: l_BaseKit_p_2023.2.0.49397_offline.sh 6) linux version (base) llm@llm-NUC13RNGi9:~$ uname -a Linux llm-NUC13RNGi9 6.5.0-25-generic #25~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Feb 20 16:09:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux 7) after above installation, my env is : python version is: Python 3.9.19 (ces2024_python39) llm@llm-NUC13RNGi9:~$ pip list bigdl-core-xe 2.5.0b20240324 bigdl-core-xe-esimd 2.5.0b20240324 bigdl-llm 2.5.0b20240324 intel-extension-for-pytorch 2.0.110+xpu intel-openmp 2024.0.3 ipex-llm 2.1.0b20240326 8) low level env (ces2024_python39) llm@llm-NUC13RNGi9:~$ sycl-ls [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.6.0.22_223734] [opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K 3.0 [2023.16.6.0.22_223734] [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.22.26516.34] [opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 3.0 [23.22.26516.34] [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26516] [ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.26516]

9) run application : python test.py input token = ~10, output token < 100, it needs ~4s to do .generate()

-------------------- Output -------------------- 问:请简单地介绍一下上海。

答: 上海是中国最大的城市之一,位于中国东部沿海地区,是一个拥有悠久历史和现代化城市的独特魅力的地方。上海是中国重要的金融、商业和航运中心,也是世界上最具活力和吸引力的城市之一。

上海拥有许多独特的文化、美食和旅游景点。上海博物馆、上海城市历史博物馆和上海科技馆是探索上海历史和未来的好去处。外滩是上海的标志性建筑区,可以欣赏到上海最古老的建筑和最繁华的街道。南京路步行街是一个充满活力和时尚的购物区,可以品尝到各种美食。上海迪士尼度假区是上海最受欢迎的旅游景点之一,提供了各种各样的 ==============================3 4.142536640167236 ==============================4 0.0010402202606201172

qing-xu-intel commented 4 months ago

After whole env update - from gpu driver to ipec_llm pkg, performance improved, in my case, chatglm inference time reduce from 3.6s to 1.4s. 2 questions remaining:

Thanks!

qing-xu-intel commented 4 months ago

Also, I found below config does not improve chatglm performance, in reverse, "SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1" would make it worse, increase inference time from 1.4s to more than 3s.

export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_CACHE_PERSISTENT=1

hkvision commented 4 months ago

Also, I found below config does not improve chatglm performance, in reverse, "SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1" would make it worse, increase inference time from 1.4s to more than 3s.

export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 export SYCL_CACHE_PERSISTENT=1

https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/dev/benchmark/all-in-one/run-arc.sh SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS this environment variable makes the performance worse for Linux kernel 6.5. We have removed it for kernel 6.5 in our script, please check it :)

hkvision commented 4 months ago

1.4s is expected in dGPU Arc770?

Could you check your exact input and output? If as you mentioned, if the input is around 10 tokens and the output is around 100 tokens, then 1.4s generation time approximately means average latency is ~12-14ms, which should be reasonable I suppose?

hkvision commented 4 months ago

I met another issue, in my applicaiton, I chain whisper + chatglm, but I found if chatglm run after whisper, its .generate duration would increase from 1.4s to 2.1s, is there any way to improve?

Syncing offline for this issue.