intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.25k stars 1.22k forks source link

[Max1100/bigdl-llm]Met OOM easily when running Deci/DeciLM-7B int4/fp8 multi-batch(bs=8) based on bigdl-llm=2.5.0b20240124 while it could support up to bs=150 based on bigdl-llm=2.5.0b20240118 #9994

Open Yanli2190 opened 5 months ago

Yanli2190 commented 5 months ago

When running Deci/DeciLM-7B int4/fp8 multi-batch with Max1100, comparing the result between bigdl-llm=2.5.0b20240124 and bigdl-llm=2.5.0b20240118, single batch latency improved from 12.3ms to 9.6ms for 512/512, but it leads to easily OOM for multi-batch, it supports up to bs=150 based on bigdl-llm=2.5.0b20240118, but we met OOM if we set bs=8 on bigdl-llm=2.5.0b20240124 HW: Max1100 OS: Ubuntu 22.04 SW: oneAPI 2024.0/bigdl-llm 2.5.0b20240118 based on torch 2.1 GPU driver: https://dgpu-docs.intel.com/releases/stable_775_20_20231219.html How to reproduce:

create conda env and install bigdl via "pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu" run the attached run.sh on Max1100

Yanli2190 commented 5 months ago

run.txt benchmark_hf_model_bigdl.txt

qiuxin2012 commented 5 months ago

I have run your script(bs=8) on both 0118 and 0124, and It's really a problem: the peak memory is increased from 6.6GB(0118) to 12.1GB(0124). I will look into it.