Open WrRan opened 10 months ago
@WrRan
Hello. I am reaching out because I am experiencing a similar situation. I am conducting long-term inference in a multi GPU (8 units) environment. At some point, the GPU utilization drops, causing the inference to be delayed with 2 units at 100%, and the rest at 0%. If you had a similar issue, could you please tell me how you resolved it?
Anyone understand why this is happening? cc @WrRan was your issue resolved somehow?
@hillarysanders @su-park
I alleviate this problem by periodically restarting the system.
I meet same problem. The GPU utilization is fluctuating greatly, with an average utilization of only 50%, but there are still pending requests.
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
When using vLLM for offline batch prediction, I found a significant decrease in GPU utilization during long-term running. As shown in the graph below, the utilization rate is around 60-70% at 00:00, but it drops to 50-60% by 15:00. What could be the reason for this? Is there a plan to fix it?
Currently, I suspect that there are too many GPU memory fragments, and I am trying to alleviate this problem by periodically restarting the system.