[Bug]: loading fp16 model as fp8 quantized caused OOM

AlphaINF commented 1 month ago

Your current environment

+---------------------------------------------------------------------------------------+	Processes:		GPU GI CI PID Type Process name GPU Memory		ID ID Usage		=======================================================================================		No running processes found	+---------------------------------------------------------------------------------------+ (venv-vllm-54) (base) root@I1ba088648b009018e4:/hy-tmp# pip list Package Version

aiohappyeyeballs 2.3.4 aiohttp 3.10.1 aiosignal 1.3.1 annotated-types 0.7.0 anyio 4.4.0 async-timeout 4.0.3 attrs 24.1.0 certifi 2024.7.4 charset-normalizer 3.3.2 click 8.1.7 cloudpickle 3.0.0 cmake 3.30.2 datasets 2.20.0 dill 0.3.8 diskcache 5.6.3 distro 1.9.0 exceptiongroup 1.2.2 fastapi 0.112.0 filelock 3.15.4 frozenlist 1.4.1 fsspec 2024.5.0 h11 0.14.0 httpcore 1.0.5 httptools 0.6.1 httpx 0.27.0 huggingface-hub 0.24.5 idna 3.7 importlib-metadata 8.2.0 importlib-resources 6.4.0 interegular 0.3.3 jinja2 3.1.4 jsonschema 4.23.0 jsonschema-specifications 2023.12.1 lark 1.1.9 llvmlite 0.41.1 lm-format-enforcer 0.10.3 MarkupSafe 2.1.5 mpmath 1.3.0 msgpack 1.0.8 multidict 6.0.5 multiprocess 0.70.16 nest-asyncio 1.6.0 networkx 3.1 ninja 1.11.1.1 numba 0.58.1 numpy 1.24.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 9.1.0.70 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-ml-py 12.555.43 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.6.20 nvidia-nvtx-cu12 12.1.105 openai 1.39.0 outlines 0.0.46 packaging 24.1 pandas 2.0.3 pillow 10.4.0 pip 21.1.1 pkgutil-resolve-name 1.3.10 prometheus-client 0.20.0 prometheus-fastapi-instrumentator 7.0.0 protobuf 5.27.3 psutil 6.0.0 py-cpuinfo 9.0.0 pyairports 2.1.1 pyarrow 17.0.0 pyarrow-hotfix 0.6 pycountry 24.6.1 pydantic 2.8.2 pydantic-core 2.20.1 python-dateutil 2.9.0.post0 python-dotenv 1.0.1 pytz 2024.1 PyYAML 6.0.1 pyzmq 26.1.0 ray 2.10.0 referencing 0.35.1 regex 2024.7.24 requests 2.32.3 rpds-py 0.19.1 safetensors 0.4.4 sentencepiece 0.2.0 setuptools 56.0.0 six 1.16.0 sniffio 1.3.1 starlette 0.37.2 sympy 1.13.1 tiktoken 0.7.0 tokenizers 0.19.1 torch 2.4.0 torchvision 0.19.0 tqdm 4.66.5 transformers 4.43.4 triton 3.0.0 typing-extensions 4.12.2 tzdata 2024.1 urllib3 2.2.2 uvicorn 0.30.5 uvloop 0.19.0 vllm 0.5.4 vllm-flash-attn 2.6.1 watchfiles 0.22.0 websockets 12.0 xformers 0.0.27.post2 xxhash 3.4.1 yarl 1.9.4 zipp 3.19.2

🐛 Describe the bug

Currently I download a Qwen2-72B-Instruct model on my machine with 1x A100-80G I try to use the following command to start the server but it shows out of memory

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server  \
   --model="/hy-tmp/Qwen/Qwen2-72B-Instruct"  \
   --served-model-name "qwen2-72b"  \
   --host 0.0.0.0  \
   --port 8080  \
   --max-num-seqs 64 \
   --quantization="fp8" \
   --gpu-memory-utilization 0.9 \
   --enable-prefix-caching \
   --max-model-len 2048

When I test to load a fp16 model and set to quantized as fp8, I saw the VRAM increase initially, when the model has been loaded, the VRAM consume will sharply decrease at about half of the peak.(eg: loading qwen2-7b the VRAM will raise to 14GB at first and decrease to 7GB after quantized the model to fp8)

In my opinion, this bugs caused by the loading and quantizing stategy, if the system quantizing the parameters layer by layer or loading the whole parameters in RAM before quantizing, the problem will solve.

jon-chuang commented 1 month ago

Indeed, you can see the llama.cpp loads and quantizes the model weights incrementally here:

https://github.com/ggerganov/llama.cpp/blob/1e6f6554aa11fa10160a5fda689e736c3c34169f/src/llama.cpp#L15853

itechbear commented 1 month ago

This behavior is well documented at: https://docs.vllm.ai/en/latest/quantization/fp8.html .

Currently, we load the model at original precision before quantizing down to 8-bits, so you need enough memory to load the whole model.

Although, incrementally processing layers is much preferred.

vllm-project / vllm

[Bug]: loading fp16 model as fp8 quantized caused OOM #7200

Your current environment

🐛 Describe the bug