Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.25k
stars
1.22k
forks
source link
[Max1100/bigdl-llm]Met OOM easily when running Deci/DeciLM-7B int4/fp8 multi-batch(bs=8) based on bigdl-llm=2.5.0b20240124 while it could support up to bs=150 based on bigdl-llm=2.5.0b20240118 #9994
When running Deci/DeciLM-7B int4/fp8 multi-batch with Max1100, comparing the result between bigdl-llm=2.5.0b20240124 and bigdl-llm=2.5.0b20240118, single batch latency improved from 12.3ms to 9.6ms for 512/512, but it leads to easily OOM for multi-batch, it supports up to bs=150 based on bigdl-llm=2.5.0b20240118, but we met OOM if we set bs=8 on bigdl-llm=2.5.0b20240124
HW: Max1100
OS: Ubuntu 22.04
SW: oneAPI 2024.0/bigdl-llm 2.5.0b20240118 based on torch 2.1
GPU driver: https://dgpu-docs.intel.com/releases/stable_775_20_20231219.html
How to reproduce:
I have run your script(bs=8) on both 0118 and 0124, and It's really a problem: the peak memory is increased from 6.6GB(0118) to 12.1GB(0124).
I will look into it.
When running Deci/DeciLM-7B int4/fp8 multi-batch with Max1100, comparing the result between bigdl-llm=2.5.0b20240124 and bigdl-llm=2.5.0b20240118, single batch latency improved from 12.3ms to 9.6ms for 512/512, but it leads to easily OOM for multi-batch, it supports up to bs=150 based on bigdl-llm=2.5.0b20240118, but we met OOM if we set bs=8 on bigdl-llm=2.5.0b20240124 HW: Max1100 OS: Ubuntu 22.04 SW: oneAPI 2024.0/bigdl-llm 2.5.0b20240118 based on torch 2.1 GPU driver: https://dgpu-docs.intel.com/releases/stable_775_20_20231219.html How to reproduce:
create conda env and install bigdl via "pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu" run the attached run.sh on Max1100