Open zhaotyer opened 5 months ago
Is anyone investigating this issue?
May be fixed by #5355.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
🐛 Describe the bug
code is tritonserver+vllm
you can find code in #https://github.com/triton-inference-server/vllm_backend/blob/main/src/model.pyGPU memory usage after model loading
The model has already performed self.model_runner.profile_run() before block allocation and calculated the peak_memory Why does gpu memory oom during inference?