vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.97k stars 4.71k forks source link

[Bug]: 启动之后 用了一段时间 显存越占越多 #8413

Open lxb0425 opened 2 months ago

lxb0425 commented 2 months ago

Your current environment

2*A100 配置 启动项 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 7864 --max-model-len 8000 --served-model-name chat-v2.0 --model /workspace/sdata/checkpoint-140-merged --enforce-eager --tensor-parallel-size 2 --gpu-memory-utilization 0.95

Model Input Dumps

image

image

🐛 Describe the bug

启动后 使用一段时间 显存越占越大 最后会崩掉

Before submitting a new issue...

DarkLight1337 commented 2 months ago

Does it keep increasing until OOM if you leave the server idle?

lxb0425 commented 2 months ago

闲置状态比较好 只有一直调用后 才会这样 又增加了 image

DarkLight1337 commented 2 months ago

@youkaichao @robertgshaw2-neuralmagic any idea about this?

gongjl123 commented 3 hours ago

我也遇到了这个问题 最后怎么解决的啊 大哥 python -m vllm.entrypoints.openai.api_server --model /home/fitech/qianwen2.5/qianwen2.5-14b-int4/qianwen2.5-14b-int4 --trust-remote-code --served-model-name Qwen2.5-14B-Instruct-GPTQ-INT4 --gpu-memory-utilization 0.6 --max-model-len=2048 --port 8788