vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31k stars 4.71k forks source link

[Usage]: --cpu-offload-gb no use #9339

Open Rane2021 opened 1 month ago

Rane2021 commented 1 month ago

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

export MODEL_PATH=/data3/models/Qwen2.5-32B-Instruct

python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH --seed 1100 --dtype bfloat16 --trust-remote-code --port 8001 --gpu-memory-utilization 0.9 --cpu-offload-gb 30

gpu-out-mem, cpu-offload-gb no use

Before submitting a new issue...

Rane2021 commented 1 month ago

i run on A6000 with 48G and vllm==0.5.3

15929482853 commented 1 month ago

It also happend to me. I want to deploy glm4-chat-9b with command below: "vllm serve /glm4-9b-chat --port 25011 --served-model-name glm4-9b-chat --dtype half --gpu-memory-utilization 0.9 --trust-remote-code --enforce-eager --enable-prefix-caching --cpu-offload-gb 12 --max-model-len 2048" on a container with 12g vpu. gpumem + cpumem = 24g, then I meet OOM fault