triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.72k stars 1.42k forks source link

Inference Time in Triton Server Responses #6692

Open teith opened 6 months ago

teith commented 6 months ago

Is your feature request related to a problem? Please describe. Yes, currently Triton Inference Server doesn't provide per-request inference time in the HTTP/gRPC response. This makes real-time performance monitoring and analysis less efficient, as we need to rely on aggregated metrics or separate request tracing.

Describe the solution you'd like I'd like the server to include the exact inference time for each individual request in the HTTP/gRPC response, either in the header or body. This would allow for immediate and precise monitoring of model performance.

Describe alternatives you've considered I've looked into using Prometheus metrics and request tracing, but these provide aggregated data or require additional processing for individual request timings.

Additional context Having per-request inference time directly in the response would significantly enhance real-time monitoring and optimization capabilities for Triton users.

denti commented 6 months ago

+1

d2avids commented 6 months ago

+1

D1-3105 commented 6 months ago

+1

atomofiron commented 6 months ago

+1

SkobelkinYaroslav commented 6 months ago

+1

Alexey-Sandor commented 6 months ago

+1

oandreeva-nv commented 6 months ago

Hi @teith, could you please clarify why tracing doesn't work for your case? Both OpenTelemetry and Triton Trace APIs collect timestamps for when request came to server- was queued - started execution- finished execution- left the server per request level.

oandreeva-nv commented 6 months ago

Feel free to tag me, this way I'll receive a notification and an email and will be able to respond quicker.

teith commented 6 months ago

Hi, @oandreeva-nv! Thank you for your response. The reason why tracing isn't ideal for our case is that we use Triton within our service and need to perform billing for each user's inference request. Using OpenTelemetry would generate a vast number of traces, overwhelming our resources. Therefore, it would be ideal for us if inference timings could be included directly in the response to the inference request. This would be extremely convenient and efficient for our billing process.

oandreeva-nv commented 6 months ago

Would Custom Metrics and TRITON C API custom metrics be of any help to you?