Open Rane2021 opened 1 month ago
i run on A6000 with 48G and vllm==0.5.3
It also happend to me. I want to deploy glm4-chat-9b with command below: "vllm serve /glm4-9b-chat --port 25011 --served-model-name glm4-9b-chat --dtype half --gpu-memory-utilization 0.9 --trust-remote-code --enforce-eager --enable-prefix-caching --cpu-offload-gb 12 --max-model-len 2048" on a container with 12g vpu. gpumem + cpumem = 24g, then I meet OOM fault
Your current environment
How would you like to use vllm
export MODEL_PATH=/data3/models/Qwen2.5-32B-Instruct
python3 -m vllm.entrypoints.openai.api_server --model $MODEL_PATH --seed 1100 --dtype bfloat16 --trust-remote-code --port 8001 --gpu-memory-utilization 0.9 --cpu-offload-gb 30
gpu-out-mem, cpu-offload-gb no use
Before submitting a new issue...