Open MrD005 opened 2 months ago
same problem on 2 * L20
Can confirm this is happening. I'm not entirely sure this is due to high load or if there is a poisoned
request that makes it crash, but I have managed to reproduce this by merely bombarding the server with requests.
After a point, the server stalls and refuses to accept new requests, even after all the other requests have been fulfilled.
Bug Description: When the Triton Inference Server experiences high traffic, it appears to freeze and stops processing incoming requests. During this time, the GPU utilization reaches 100% and stays stuck at that level, but no further requests are processed.
This issue leads to a bottleneck where the server no longer responds to requests until it is restarted or traffic decreases significantly.
Steps to Reproduce:
Deploy Triton Inference Server on a GPU-based environment. Send a high volume of concurrent inference requests (e.g., thousands of requests per second). Monitor the GPU utilization and request processing. Observed Behavior:
GPU utilization spikes to 100% and remains stuck at that level. No new requests are processed after the spike. Server becomes unresponsive. Expected Behavior:
Triton Inference Server should continue processing requests and properly manage traffic load without freezing. GPU utilization should fluctuate based on the load but not lead to a total freeze. Environment:
GPU model: 2xH100 CUDA version: 11.8 TensorRT version: 0.10.0.dev2024043000 OS: ubuntu 22