intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.47k stars 1.24k forks source link

Chatglm2-6B multi-instance has unexpected performance on Arc A770 #9329

Open KiwiHana opened 10 months ago

KiwiHana commented 10 months ago

ChatGLM2-6B use multi-batch size by bigdl-llm[xpu] 20231016 on Arc 770 with Xeon.

For 32in/32out,instance=1, rest latency is 20.5ms/token. For 32in/32out,instance=2, rest latency is 224.5ms/token.

download script: https://github.com/biyuehuang/LLM_Arc_UI/blob/main/gpu_benchmark/test_chatglm2.py

conda activate llm-test
source /opt/intel/oneapi/setvars.sh
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export LD_PRELOAD=${LD_PRELOAD}:/home/adc-a770/miniconda3/envs/llm-test/lib/libtcmalloc.so
$ python test_chatglm2.py -m ~/llm/chatglm2-6b-int4 --input-tokens   32 --max-new-tokens   32 | grep -ie "First token" -ie "Rest tokens" & python test_chatglm2.py -m ~/llm/chatglm2-6b-int4 --input-tokens   32 --max-new-tokens   32 | grep -ie "First token" -ie "Rest tokens"
=========First token cost 1.4945s=========
=========Rest tokens cost average 0.2556s (31 tokens in all)=========
=========First token cost 0.4987s=========
=========Rest tokens cost average 0.2404s (31 tokens in all)=========
=========First token cost 0.5385s=========
=========Rest tokens cost average 0.2404s (31 tokens in all)=========
=========First token cost 0.5683s=========
=========Rest tokens cost average 0.2242s (31 tokens in all)=========
=========First token cost 0.5682s=========
=========Rest tokens cost average 0.2248s (31 tokens in all)=========
=========First token cost 1.7827s=========
=========Rest tokens cost average 0.2822s (31 tokens in all)=========
=========First token cost 0.8250s=========
=========Rest tokens cost average 0.2424s (31 tokens in all)=========
=========First token cost 0.4940s=========
=========Rest tokens cost average 0.2239s (31 tokens in all)=========
=========First token cost 0.5682s=========
=========Rest tokens cost average 0.2242s (31 tokens in all)=========
=========First token cost 0.5782s=========
=========Rest tokens cost average 0.1865s (31 tokens in all)=========
liu-shaojun commented 10 months ago

Hi @rnwang04 could you help on this issue?

rnwang04 commented 10 months ago

Hi @KiwiHana , I test this case on my local arc machine and I can't reproduce your results. ChatGLM2-6B bigdl-llm[xpu] 20231101 on Arc 770 with i9-12900K. env setting:

export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1

For 32in/32out,instance=1, rest latency is 20.1ms/token. For 32in/32out,instance=2, rest latency is 32.4ms/token.

Would you mind testing it without tcmalloc again? Or maybe this dismatch is caused by CPU freq?