[Feature]: automatically release graphics card memory

🚀 The feature, motivation and pitch

I use vllm.entrypoints.openai.api_user to start my large model, and the specific command is as follows:

python3 -m vllm.entrypoints.openai.api_server  --model /data/bertmodel/Qwen/Qwen2.5-32B-Instruct  --served-model-name Yi-1.5-34B-Chat --max_model_len 20000 --enable-auto-tool-choice --tool-call-parser hermes

But I found that when I don't use model inference, the service doesn't automatically release graphics card memory. I am worried that there may be problems with OOM in the future, so I hope to add code that automatically releases graphics card memory.

微信截图_20241101032652

Alternatives

No response

Additional context

No response

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

vllm-project / vllm

[Feature]: automatically release graphics card memory #9903

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...