proxy: tune latency histogram buckets

tobiasmuehl commented 2 years ago

What problem are you trying to solve?

The latency percentiles calculated by Linkerd Viz are highly inaccurate for my workload. I have deployed a demo service that produces a HTTP response after 5 seconds. In production, this service can take either 5000ms or 5001ms to respond. Because the closest buckets are 5000 and 10000, the p95 latency is calculated as 9750, however, not a single request takes longer than 5001.

I'm aware that this is an exact duplicate of another issue, but hopefully this adds more weight to the issue. It's something I'm facing on production on a daily basis.

How should the problem be solved?

Buckets should be sized according to expected latency. There are a lot of assumptions around latencies with developers saying things like "60 seconds ought to be enough", or "Prometheus maxes out at 10 seconds". Personally I think this is dangerous - Observability is a main pillar of linkerd and is arguable most interesting in context of services that have become slow. Using Prometheus inherently introduces limitations, so configuration should be possible by the end user.

For example, bucket configuration can be added to Service Profiles (ideally on a per route basis). It could be either a list of buckets, or, preferably, some parameters, for example:

equally sized buckets from 0 to 5 seconds with a maximum 5% measurement error
non-equally sized buckets from 0 to 5 seconds with a measurement error ranging from 1 to 10% with an expected latency of 800ms. This would result in smaller buckets around the 800ms mark.

Any alternatives you've considered?

I could enable HTTP access logs in Linkerd and try to extract the histograms out of those. It is significantly more effort as a completely separate log pipeline has to be built and is limited to HTTP requests. gRPC services won't be instrumented.

Implementing a cluster wide logging infrastructure for k8s for only this feature is a massive undertaking and I'd probably just add another reverse proxy like nginx in front of every application - adding yet another reverse proxy :(

A more convenient solution for the end user which requires significantly more engineering effort: The proxy could implement circllhist which is a merge-able data-structure for quantiles that does not need to be configured and uses very wide boundaries:

The largest representable number of the circllhist is 99 · 10127. This number is larger than the age of the universe measured in nano-seconds (13.8 billion years) The smallest representable positive number is 10 · 10−128. This number is smaller than the Plank time measured in years (5.39 · 10−44 s). We have found this value range to be sufficient for all practical purposes.

This data then needs to be scraped from the linkerd proxy and merged in a central place.

How would users interact with this feature?

No response

Would you like to work on this feature?

No response

alekhrycaiko commented 1 year ago

Hi, I'm curious if there's any intended movement on this? I've been facing a similar issue where bucket sizes are leading to inaccurate metrics being reported. I'd really love for an escape hatch to be implemented that allows for expanding the bucket sizes as per needed. Is something like that feasible?

wmorgan commented 1 year ago

Nothing is on the roadmap for this today. A Circllhist Rust library could be a great project for someone out there.

hawkw commented 1 year ago

Regarding circllhist, there's also a Rust HDRHistogram library, which is a similar dynamic histogram algorithm. However, I believe we chose not to use HDRHistogram because a dynamic histogram won't work with Prometheus, which requires the buckets of a histogram metric to remain the same over the entire lifetime of that metric. A dynamic histogram like HDRHistogram will resize buckets at runtime to ensure that the dynamic range is represented accurately, which is good...but cannot easily be exported as Prometheus metric.

Using a dynamic histogram would, I believe, mean that we would need to expose the data to Prometheus as a "summary" metric, rather than as "histogram" metrics. In a summary metric, the raw histogram bucket values are not exposed to Prometheus, and only specific quantiles are exposed (e.g., the 50th, 95th and 99th percentiles). This would permit the use of a dynamic histogram algorithm rather than a fixed one.

wmorgan commented 1 year ago

AIUI percentiles from individual proxies (whether from HDRHistogram or otherwise) cannot be merged in a statistically sound manner, hence the focus on the “mergeability” properties of circllhist algorithm. But IANAS, and I also did not evaluate the feasibility of aggregating circllhist output by e.g. Prometheus.

On Thu, Sep 7, 2023 at 8:11 PM Eliza Weisman @.***> wrote:

Regarding circllhist, there's also a Rust HDRHistogram library https://crates.io/crates/hdrhistogram, which is a similar dynamic histogram algorithm. However, I believe we chose not to use HDRHistogram because a dynamic histogram won't work with Prometheus, which requires the buckets of a histogram metric to remain the same over the entire lifetime of that metric. A dynamic histogram like HDRHistogram will resize buckets at runtime to ensure that the dynamic range is represented accurately, which is good...but cannot easily be exported as Prometheus metric.

Using a dynamic histogram would, I believe, mean that we would need to expose the data to Prometheus as a "summary" metric https://prometheus.io/docs/concepts/metric_types/#summary, rather than as "histogram" metrics https://prometheus.io/docs/concepts/metric_types/#summary. In a summary metric, the raw histogram bucket values are not exposed to Prometheus, and only specific quantiles are exposed (e.g., the 50th, 95th and 99th percentiles). This would permit the use of a dynamic histogram algorithm rather than a fixed one.

— Reply to this email directly, view it on GitHub https://github.com/linkerd/linkerd2/issues/9588#issuecomment-1710944487, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAA7OUHBA5SLC4BXR4UFHLXZJWFRANCNFSM6AAAAAARBBCTVE . You are receiving this because you commented.Message ID: @.***>

-- -William

hawkw commented 1 year ago

Ah, that's interesting, I just skimmed the abstract! I guess whether or not it's possible to aggregate it in Prometheus would be the big question.

linkerd / linkerd2