Open teith opened 11 months ago
+1
+1
+1
+1
+1
+1
Hi @teith, could you please clarify why tracing doesn't work for your case? Both OpenTelemetry and Triton Trace APIs collect timestamps for when request came to server- was queued - started execution- finished execution- left the server per request level.
Feel free to tag me, this way I'll receive a notification and an email and will be able to respond quicker.
Hi, @oandreeva-nv! Thank you for your response. The reason why tracing isn't ideal for our case is that we use Triton within our service and need to perform billing for each user's inference request. Using OpenTelemetry would generate a vast number of traces, overwhelming our resources. Therefore, it would be ideal for us if inference timings could be included directly in the response to the inference request. This would be extremely convenient and efficient for our billing process.
Would Custom Metrics and TRITON C API custom metrics be of any help to you?
Is your feature request related to a problem? Please describe. Yes, currently Triton Inference Server doesn't provide per-request inference time in the HTTP/gRPC response. This makes real-time performance monitoring and analysis less efficient, as we need to rely on aggregated metrics or separate request tracing.
Describe the solution you'd like I'd like the server to include the exact inference time for each individual request in the HTTP/gRPC response, either in the header or body. This would allow for immediate and precise monitoring of model performance.
Describe alternatives you've considered I've looked into using Prometheus metrics and request tracing, but these provide aggregated data or require additional processing for individual request timings.
Additional context Having per-request inference time directly in the response would significantly enhance real-time monitoring and optimization capabilities for Triton users.