[Max1100/bigdl-llm]Met OOM easily when running Deci/DeciLM-7B int4/fp8 multi-batch(bs=8) based on bigdl-llm=2.5.0b20240124 while it could support up to bs=150 based on bigdl-llm=2.5.0b20240118

intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

Apache License 2.0

6.25k stars 1.22k forks source link

When running Deci/DeciLM-7B int4/fp8 multi-batch with Max1100, comparing the result between bigdl-llm=2.5.0b20240124 and bigdl-llm=2.5.0b20240118, single batch latency improved from 12.3ms to 9.6ms for 512/512, but it leads to easily OOM for multi-batch, it supports up to bs=150 based on bigdl-llm=2.5.0b20240118, but we met OOM if we set bs=8 on bigdl-llm=2.5.0b20240124 HW: Max1100 OS: Ubuntu 22.04 SW: oneAPI 2024.0/bigdl-llm 2.5.0b20240118 based on torch 2.1 GPU driver: https://dgpu-docs.intel.com/releases/stable_775_20_20231219.html How to reproduce:

create conda env and install bigdl via "pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu" run the attached run.sh on Max1100

intel-analytics / ipex-llm

[Max1100/bigdl-llm]Met OOM easily when running Deci/DeciLM-7B int4/fp8 multi-batch(bs=8) based on bigdl-llm=2.5.0b20240124 while it could support up to bs=150 based on bigdl-llm=2.5.0b20240118 #9994