Whats the query to calculate triton model latency per request? Is it nv_inference_request_duration_us / nv_inference_exec_count + nv_inference_queue_duration_us #7692
We are doing benchmarking of triton with different backends, but unable to get the metric the calculate the latency of each request (lets assume each request has batch size of b)
Is request latency = rate(nv_inference_request_duration_us[1m]) / rate(nv_inference_exec_count[1m]) + nv_inference_queue_duration_us?
Does nv_inference_request_duration_us include the queuing duration as well ? In documentation, it says its cumulative. can any one confirm?
Are compute_input and compute_output duration also included in the nv_inference_request_duration_us ?
We are doing benchmarking of triton with different backends, but unable to get the metric the calculate the latency of each request (lets assume each request has batch size of
b
)