Chatglm2-6B multi-instance has unexpected performance on Arc A770

KiwiHana commented 10 months ago

ChatGLM2-6B use multi-batch size by bigdl-llm[xpu] 20231016 on Arc 770 with Xeon.

For 32in/32out，instance=1, rest latency is 20.5ms/token. For 32in/32out，instance=2, rest latency is 224.5ms/token.

download script: https://github.com/biyuehuang/LLM_Arc_UI/blob/main/gpu_benchmark/test_chatglm2.py

conda activate llm-test
source /opt/intel/oneapi/setvars.sh
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export LD_PRELOAD=${LD_PRELOAD}:/home/adc-a770/miniconda3/envs/llm-test/lib/libtcmalloc.so
$ python test_chatglm2.py -m ~/llm/chatglm2-6b-int4 --input-tokens   32 --max-new-tokens   32 | grep -ie "First token" -ie "Rest tokens" & python test_chatglm2.py -m ~/llm/chatglm2-6b-int4 --input-tokens   32 --max-new-tokens   32 | grep -ie "First token" -ie "Rest tokens"

=========First token cost 1.4945s=========
=========Rest tokens cost average 0.2556s (31 tokens in all)=========
=========First token cost 0.4987s=========
=========Rest tokens cost average 0.2404s (31 tokens in all)=========
=========First token cost 0.5385s=========
=========Rest tokens cost average 0.2404s (31 tokens in all)=========
=========First token cost 0.5683s=========
=========Rest tokens cost average 0.2242s (31 tokens in all)=========
=========First token cost 0.5682s=========
=========Rest tokens cost average 0.2248s (31 tokens in all)=========
=========First token cost 1.7827s=========
=========Rest tokens cost average 0.2822s (31 tokens in all)=========
=========First token cost 0.8250s=========
=========Rest tokens cost average 0.2424s (31 tokens in all)=========
=========First token cost 0.4940s=========
=========Rest tokens cost average 0.2239s (31 tokens in all)=========
=========First token cost 0.5682s=========
=========Rest tokens cost average 0.2242s (31 tokens in all)=========
=========First token cost 0.5782s=========
=========Rest tokens cost average 0.1865s (31 tokens in all)=========

liu-shaojun commented 10 months ago

Hi @rnwang04 could you help on this issue?

rnwang04 commented 10 months ago

Hi @KiwiHana , I test this case on my local arc machine and I can't reproduce your results. ChatGLM2-6B bigdl-llm[xpu] 20231101 on Arc 770 with i9-12900K. env setting:

export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1

For 32in/32out，instance=1, rest latency is 20.1ms/token. For 32in/32out，instance=2, rest latency is 32.4ms/token.

Would you mind testing it without tcmalloc again? Or maybe this dismatch is caused by CPU freq?

intel-analytics / ipex-llm

Chatglm2-6B multi-instance has unexpected performance on Arc A770 #9329