Open appearancefnp opened 1 month ago
Thanks @appearancefnp , I've created an internal ticket for the team to investigate. What are types of the models you use? So that we could reproduce easily
@appearancefnp, If possible could you please share the issue repro models? Could you also try running on Triton Server 24.04 (before the TensorRT 10 upgrade) and confirm whether the issue still occurs? Thank you.
@oandreeva-nv I use mostly CNN models, 2 Transformer based models. Quite a lot of them. These issues were not present in Triton Server 24.01 (haven't tested other versions). @pskiran1 How can we share the models securely?
These issues were not present in Triton Server 24.01 (haven't tested other versions).
@appearancefnp, thanks for letting us know. We believe this issue is similar to a known TensorRT issue introduced in 24.05. Let us confirm it by trying from our end.
How can we share the models securely?
I believe you can share it via Google Drive with no public access. We can request access, and once we notify you that we have downloaded the models, you can delete those from Google Drive to avoid further access requests. Please share issue reproduction models, client, and necessary steps with us. Thank you.
Hello,
We have a similar issue starting Triton 24.05 It's reassuring to see we are not the only ones with this issue. We have already filed an NVBug, but I will post here some details about our setup in case that can be of use to the author of this issue.
NVIDIA Bug ID:4765405
Bug Description We are encountering an inference hang issue when deploying models using TensorRT with Triton Server. The key details are:
Affected Models:
Symptoms: After handling a few successful inference requests, GPU utilization spikes to 100%, and the server becomes unresponsive. Inference requests remain in the PENDING/EXECUTING state without progressing. The issue is reproducible with Triton Server versions 24.05 and later (incorporating TensorRT 10.3 and 10.5), while earlier versions do not exhibit this problem. Deploying multiple instances of the same Python model on a single GPU worsens the issue. With Triton Server 24.07 and TensorRT 10.3, the problem manifests more quickly.
Reproducible Steps Environment Setup:
Deployment Configuration:
Model Invocation:
Workaround Attempts:
Set CUDA_LAUNCH_BLOCKING=1: Enable synchronous CUDA operations to help identify CUDA errors, potentially degrading performance.Same as above this leads to unacceptable latencies
In our case the issue seems to come from TensorRT and we have been told that they are working on fixing the bug.
Description When running the latest Triton Inference Server - everything runs fine. It can be normal for multiple hours but suddenly the Triton Server lags. It has 100% GPU Utilization and the pending request count grows until RAM is full and the server crashes.
Pending requests:
GPU Utilization:
Triton Information What version of Triton are you using? 24.08
Are you using the Triton container or did you build it yourself? Triton Container
To Reproduce Steps to reproduce the behavior. It is hard to reproduce because it takes multiple hours/days. Just multiple models getting inferenced on TensorRT backend. Possibly a race condition or a dead-lock?
Describe the models: I have ~30 TensorRT models running. The inference requests come randomly. The server has 2 A5000 GPUs.
Here is a log when the server hangs up. From the log files (verbosity level - 2), you can see that at one point the state doesn't change anymore.
This is a normal scenario:
Here you can see most inference request just go to INITIALIZED state and never go to EXECUTING.
Expected behavior A clear and concise description of what you expected to happen.
NVIDIA - please fix.