Open siddhatiwari opened 3 months ago
How are you serving the request? Are you launching using the official launch_triton_server.py file?
I am experiencing this too @dhruvmullick. It started with the 0.13 dev version, all 0.12 versions work just fine. This is an issue with the server itself, not with the way it is launched. It probabilistically hangs with 100% GPU utilization even though no requests are inflight.
Are there any updates on this? Still experiencing this on the latest TRT LLM and backend versions
@dhruvmullick I'm launching using the tritonserver
CLI and using .stream_infer()
from the grpc client library to send requests: https://github.com/triton-inference-server/client/blob/cb9ba08b3f88dff802485f0577b008cdbf41c529/src/python/library/tritonclient/grpc/aio/__init__.py#L688
These might also be the same issue: https://github.com/triton-inference-server/tensorrtllm_backend/issues/574 https://github.com/triton-inference-server/tensorrtllm_backend/issues/596
This now seems to be fixed as of the November 5th update to the main branch.
System Info
After roughly 30 seconds of inference requests, the inference server stalls, not responding to any requests. There are no error codes or crashes visible in logs. The server is using decoupled mode with dynamic_batching.
These are the parameters for the engine used:
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
Inference server doesn't stall
actual behavior
Inference server stalls
additional notes
Initial requests complete successfully, so not sure why it stalls afterwards