But I found that when I don't use model inference, the service doesn't automatically release graphics card memory.
I am worried that there may be problems with OOM in the future, so I hope to add code that automatically releases graphics card memory.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
🚀 The feature, motivation and pitch
I use vllm.entrypoints.openai.api_user to start my large model, and the specific command is as follows:
But I found that when I don't use model inference, the service doesn't automatically release graphics card memory. I am worried that there may be problems with OOM in the future, so I hope to add code that automatically releases graphics card memory.
Alternatives
No response
Additional context
No response
Before submitting a new issue...