Open namogg opened 5 months ago
I got the same problem about inference time when using triton server. My yolov8n model tensorrt infer time = 0.0027s(agx orin). When I use triton infer total time got 0.033s(too slow to original inference). preprocessing time yolov8 on agx orin = 0.014s, infer time (using server infer) = 0.0187s, post processing time = 0.000816s. I think sending time on server quite comsuming. How can I decrease server processing time.
For my testing, communication time is not that time consuming. Eventhough throughput of Triton is higher than TensorRT when handling multiple request, the latency for each client is too high and reliable.
Description Im using a simple client inference class base on client example. My tensorRT inference with batchsize 10 with 150ms and my triton with tensorRT backend took 1100ms. This is my client:
Triton Information What version of Triton are you using? 2.42
Are you using the Triton container or did you build it yourself? container To Reproduce Model config:
Perf analyzer:
Expected behavior Triton should be able to run the same speed with tensorRT and even better with concurrent