triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
694 stars 103 forks source link

Triton Inference Server Stops Processing Requests under High Traffic, GPU Utilization Stuck at 100% #574

Open MrD005 opened 2 months ago

MrD005 commented 2 months ago

Bug Description: When the Triton Inference Server experiences high traffic, it appears to freeze and stops processing incoming requests. During this time, the GPU utilization reaches 100% and stays stuck at that level, but no further requests are processed.

This issue leads to a bottleneck where the server no longer responds to requests until it is restarted or traffic decreases significantly.

Steps to Reproduce:

Deploy Triton Inference Server on a GPU-based environment. Send a high volume of concurrent inference requests (e.g., thousands of requests per second). Monitor the GPU utilization and request processing. Observed Behavior:

GPU utilization spikes to 100% and remains stuck at that level. No new requests are processed after the spike. Server becomes unresponsive. Expected Behavior:

Triton Inference Server should continue processing requests and properly manage traffic load without freezing. GPU utilization should fluctuate based on the load but not lead to a total freeze. Environment:

GPU model: 2xH100 CUDA version: 11.8 TensorRT version: 0.10.0.dev2024043000 OS: ubuntu 22

hcnhcn012 commented 2 months ago

same problem on 2 * L20

gabriel-peracio commented 1 month ago

Can confirm this is happening. I'm not entirely sure this is due to high load or if there is a poisoned request that makes it crash, but I have managed to reproduce this by merely bombarding the server with requests.

After a point, the server stalls and refuses to accept new requests, even after all the other requests have been fulfilled.