[Usage]: --cpu-offload-gb no use

Rane2021 commented 1 month ago

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

export MODEL_PATH=/data3/models/Qwen2.5-32B-Instruct

python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH --seed 1100 --dtype bfloat16 --trust-remote-code --port 8001 --gpu-memory-utilization 0.9 --cpu-offload-gb 30

gpu-out-mem, cpu-offload-gb no use

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Rane2021 commented 1 month ago

i run on A6000 with 48G and vllm==0.5.3

15929482853 commented 1 month ago

It also happend to me. I want to deploy glm4-chat-9b with command below: "vllm serve /glm4-9b-chat --port 25011 --served-model-name glm4-9b-chat --dtype half --gpu-memory-utilization 0.9 --trust-remote-code --enforce-eager --enable-prefix-caching --cpu-offload-gb 12 --max-model-len 2048" on a container with 12g vpu. gpumem + cpumem = 24g, then I meet OOM fault

vllm-project / vllm