vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.75k stars 4.49k forks source link

[Feature]: automatically release graphics card memory #9903

Open turkeymz opened 6 days ago

turkeymz commented 6 days ago

🚀 The feature, motivation and pitch

I use vllm.entrypoints.openai.api_user to start my large model, and the specific command is as follows:

python3 -m vllm.entrypoints.openai.api_server  --model /data/bertmodel/Qwen/Qwen2.5-32B-Instruct  --served-model-name Yi-1.5-34B-Chat --max_model_len 20000 --enable-auto-tool-choice --tool-call-parser hermes

But I found that when I don't use model inference, the service doesn't automatically release graphics card memory. I am worried that there may be problems with OOM in the future, so I hope to add code that automatically releases graphics card memory.

微信截图_20241101032652

Alternatives

No response

Additional context

No response

Before submitting a new issue...

DarkLight1337 commented 6 days ago

See #5716