intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.61k stars 1.26k forks source link

the gap of performance between gegerate.py with all-in-one with batch 1 is too big #10465

Closed Fred-cell closed 6 months ago

Fred-cell commented 7 months ago

benchmark chatglm3-6b with generate.py and W4A16, the performance is as below: image

benchmark chatglm3-6b with all-in-one and W4A16, the performance is as below: image

chtanch commented 7 months ago

Tested on Arc A770; i9 13900K

I obtained similar 1st and 2nd token latencies for both run.py and generate.py.

For all-in-one benchmark

hkvision commented 7 months ago

Confirmed this is due to kernel 6.5 and setting export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1

Fix run-arc.sh in https://github.com/intel-analytics/BigDL/pull/10498

Fred-cell commented 7 months ago

Qwen-7B-Chat has the same issue for 2.5.0b20240322 version

hkvision commented 6 months ago

You need to unset SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1

hkvision commented 6 months ago

https://github.com/intel-analytics/ipex-llm/pull/10566 The problematic environment variable won't be set for kernel 6.5. Issue fixed.