triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.35k stars 1.49k forks source link

Histogram Metric for multi-instance tail latency aggregation #7672

Open AshwinAmbal opened 1 month ago

AshwinAmbal commented 1 month ago

Is your feature request related to a problem? Please describe. This issue is similar to the one mentioned here: https://github.com/triton-inference-server/server/issues/7287. I'd like to file an issue for histogram metric in Triton core. I remember this being mentioned as being in the backlog from the previous issue but would like to have this for tracking purposes.

Currently it isn't possible to calculate 95th or 99th percentile latencies when we deploy multiple Triton servers to host models. This also affects scaling decisions from being imprecise.

Describe the solution you'd like Use histogram-quantile instead of summaries as the Prometheus metric exporter: https://prometheus.io/docs/practices/histograms/#:~:text=The%20essential%20difference%20between%20summaries,the%20server%20side%20using%20the

Describe alternatives you've considered I've tried the wrong way of doing it (Avg of P95 across different Triton servers) but that is not the true p95.

I've also tried using the Avg queue time/latency (keeping those very low) instead but that doesn't help with dealing with tail latencies during spikes.

cc: @rmccorm4

rmccorm4 commented 1 month ago

Hi @AshwinAmbal, thanks for adding a ticket for tracking!

CC @yinggeh @harryskim @statiraju for viz