vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.03k stars 4.72k forks source link

GPU utilization decrease during long-term running #2556

Open WrRan opened 10 months ago

WrRan commented 10 months ago

When using vLLM for offline batch prediction, I found a significant decrease in GPU utilization during long-term running. As shown in the graph below, the utilization rate is around 60-70% at 00:00, but it drops to 50-60% by 15:00. What could be the reason for this? Is there a plan to fix it?

image

Currently, I suspect that there are too many GPU memory fragments, and I am trying to alleviate this problem by periodically restarting the system.

su-park commented 9 months ago

@WrRan

Hello. I am reaching out because I am experiencing a similar situation. I am conducting long-term inference in a multi GPU (8 units) environment. At some point, the GPU utilization drops, causing the inference to be delayed with 2 units at 100%, and the rest at 0%. If you had a similar issue, could you please tell me how you resolved it?

hillarysanders commented 8 months ago

Anyone understand why this is happening? cc @WrRan was your issue resolved somehow?

WrRan commented 6 months ago

@hillarysanders @su-park

I alleviate this problem by periodically restarting the system.

wanghia commented 3 months ago

I meet same problem. The GPU utilization is fluctuating greatly, with an average utilization of only 50%, but there are still pending requests.

github-actions[bot] commented 3 days ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!