triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.32k stars 1.48k forks source link

Inference Time in Triton Server Responses #6692

Open teith opened 11 months ago

teith commented 11 months ago

Is your feature request related to a problem? Please describe. Yes, currently Triton Inference Server doesn't provide per-request inference time in the HTTP/gRPC response. This makes real-time performance monitoring and analysis less efficient, as we need to rely on aggregated metrics or separate request tracing.

Describe the solution you'd like I'd like the server to include the exact inference time for each individual request in the HTTP/gRPC response, either in the header or body. This would allow for immediate and precise monitoring of model performance.

Describe alternatives you've considered I've looked into using Prometheus metrics and request tracing, but these provide aggregated data or require additional processing for individual request timings.

Additional context Having per-request inference time directly in the response would significantly enhance real-time monitoring and optimization capabilities for Triton users.

denti commented 11 months ago

+1

d2avids commented 11 months ago

+1

D1-3105 commented 11 months ago

+1

atomofiron commented 11 months ago

+1

SkobelkinYaroslav commented 11 months ago

+1

Alexey-Sandor commented 11 months ago

+1

oandreeva-nv commented 11 months ago

Hi @teith, could you please clarify why tracing doesn't work for your case? Both OpenTelemetry and Triton Trace APIs collect timestamps for when request came to server- was queued - started execution- finished execution- left the server per request level.

oandreeva-nv commented 11 months ago

Feel free to tag me, this way I'll receive a notification and an email and will be able to respond quicker.

teith commented 11 months ago

Hi, @oandreeva-nv! Thank you for your response. The reason why tracing isn't ideal for our case is that we use Triton within our service and need to perform billing for each user's inference request. Using OpenTelemetry would generate a vast number of traces, overwhelming our resources. Therefore, it would be ideal for us if inference timings could be included directly in the response to the inference request. This would be extremely convenient and efficient for our billing process.

oandreeva-nv commented 11 months ago

Would Custom Metrics and TRITON C API custom metrics be of any help to you?