tikv / rust-prometheus

Prometheus instrumentation library for Rust applications
Apache License 2.0
1.04k stars 182 forks source link

Add MaximumOverIntervalGauge #498

Open kushudai opened 9 months ago

kushudai commented 9 months ago

This is mostly a reimplementation of https://github.com/tikv/rust-prometheus/pull/469 updated to address the main comment.

The previous use case was fine using i64 as it was tracking queue sizes which were integral. However, we've noticed noise in our latency histograms. Since a histogram_quantile of 1 is merely an approximation (or an extrapolation, in case of not enough data points), we've seen "max" metrics skew heavily towards the the largest bucket. We know these are not real because the server side timeouts are much smaller than the largest bucket but a bit bigger than the second largest bucket. A similar but separate problem is not having enough 9s for systems that serve hundreds of thousands of RPS - P99 does not accurately reflect tail latencies for these and adjusting the charts on a per use case basis is painful busywork.

Instead of finely tuning buckets for different latency histogram metrics, we'd like to be able to report the maximum latency observed for a given time period (this is usually the scraping interval). This allows us to put a cap on maximum latency seen on server side processing which then allows to accurately attribute network latency as seen by clients.

kushudai commented 9 months ago

Hi @lucab, given that you reviewed the original PR, I was hoping you could take a look at this. Thank you!

kushudai commented 8 months ago

Hi @lucab, I apologize for the nag but I was hoping you could take a look at this one :)