triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.18k stars 1.46k forks source link

Whats the query to calculate triton model latency per request? Is it nv_inference_request_duration_us / nv_inference_exec_count + nv_inference_queue_duration_us #7692

Open jayakommuru opened 3 days ago

jayakommuru commented 3 days ago

We are doing benchmarking of triton with different backends, but unable to get the metric the calculate the latency of each request (lets assume each request has batch size of b)

  1. Is request latency = rate(nv_inference_request_duration_us[1m]) / rate(nv_inference_exec_count[1m]) + nv_inference_queue_duration_us?
  2. Does nv_inference_request_duration_us include the queuing duration as well ? In documentation, it says its cumulative. can any one confirm?
  3. Are compute_input and compute_output duration also included in the nv_inference_request_duration_us ?
jayakommuru commented 3 days ago

@oandreeva-nv can you help with this ?