Is your feature request related to a problem? Please describe.
This issue is similar to the one mentioned here: https://github.com/triton-inference-server/server/issues/7287. I'd like to file an issue for histogram metric in Triton core. I remember this being mentioned as being in the backlog from the previous issue but would like to have this for tracking purposes.
Currently it isn't possible to calculate 95th or 99th percentile latencies when we deploy multiple Triton servers to host models. This also affects scaling decisions from being imprecise.
Describe alternatives you've considered
I've tried the wrong way of doing it (Avg of P95 across different Triton servers) but that is not the true p95.
I've also tried using the Avg queue time/latency (keeping those very low) instead but that doesn't help with dealing with tail latencies during spikes.
Is your feature request related to a problem? Please describe. This issue is similar to the one mentioned here: https://github.com/triton-inference-server/server/issues/7287. I'd like to file an issue for histogram metric in Triton core. I remember this being mentioned as being in the backlog from the previous issue but would like to have this for tracking purposes.
Currently it isn't possible to calculate 95th or 99th percentile latencies when we deploy multiple Triton servers to host models. This also affects scaling decisions from being imprecise.
Describe the solution you'd like Use histogram-quantile instead of summaries as the Prometheus metric exporter: https://prometheus.io/docs/practices/histograms/#:~:text=The%20essential%20difference%20between%20summaries,the%20server%20side%20using%20the
Describe alternatives you've considered I've tried the wrong way of doing it (Avg of P95 across different Triton servers) but that is not the true p95.
I've also tried using the Avg queue time/latency (keeping those very low) instead but that doesn't help with dealing with tail latencies during spikes.
cc: @rmccorm4