Micrometer's `Histogram` produces invalid Prometheus time series

yskopets commented 6 years ago

Issue: Prometheus defines Histogram as a composition of several time series (see https://prometheus.io/docs/concepts/metric_types/#histogram):

cumulative counters for the observation buckets, exposed as _bucket{le=""}
the total sum of all observed values, exposed as _sum
the count of events that have been observed, exposed as _count (identical to _bucket{le="+Inf"} above)

Apparently, Micrometer fails to comply with requirement #1

As a result, Prometheus functions (such as https://prometheus.io/docs/querying/functions/#histogram_quantile) don't work as expected

Expected behaviour:

histogram = DistributionSummary.builder("response_size_bytes")
                .description("Distribution of response sizes")
                .histogram(Histogram.linear(0, 10, 20))
                .register(registry)

histogram.record(15)

would produce the following Prometheus time series:

response_size_bytes_bucket{le="0.0",} 0.0
response_size_bytes_bucket{le="10.0",} 0.0
response_size_bytes_bucket{le="20.0",} 1.0
response_size_bytes_bucket{le="30.0",} 1.0
...
response_size_bytes_bucket{le="190.0",} 1.0
response_size_bytes_bucket{le="+Inf",} 1.0
response_size_bytes_count 1.0
response_size_bytes_sum 15.0

Actual behaviour:

response_size_bytes_bucket{le="20.0",} 1.0
response_size_bytes_count 1.0
response_size_bytes_sum 15.0

Notice that only 1 bucket is present and 20 are missing

Affected versions: The issue is reproducible on 0.11.0.RELEASE

checketts commented 6 years ago

@yskopets You mention the histogram_quantile isn't working as expected. Can you give some more details in that regard?

The behavior you see is an attempt to be dynamic in regard to the histogram bucket creation. (Cleary that would need to be changed if it breaks Prometheus)

jkschneider commented 6 years ago

I'd specifically like to see how histogram_quantile does not work as expected, because in my tests I found that I could add additional cumulative buckets as samples in that range were found, provided that any buckets added to the right of existing ones were initialized with counts seen prior.

What we are trying to achieve is an optimization around the number of time series created for a given histogram. This is especially important because Micrometer's special percentiles histogram collects up to 276 buckets that Netflix has empirically determined divide the real line in such a way that the error bound on the percentile approximation for many timers and distribution summaries is acceptable. It is important that we find a way to give developers server-side percentile approximations without requiring them to have a priori knowledge of the expected distribution in production (and ideally without really having to think about histograms at all).

What I think could be improved here is when you specify linear buckets such as you have, we could guarantee those are shipped.

jkschneider commented 6 years ago

@checketts I agree if there is a strong argument for breakage in Prometheus we will be forced to ship all the buckets to Prom. What frustrates me is why they chose cumulative buckets in the first place, since cumulative histograms can be trivially derived from normal histograms and vice versa.

I'm reassured by the fact that even if we lose the storage optimization now, we could make a strong argument for Prometheus to change to normal histograms in a later release.

checketts commented 6 years ago

I'm reassured by the fact that even if we lose the storage optimization now, we could make a strong argument for Prometheus to change to normal histograms in a later release.

I agree. It is ironic how many other items have been optimized, and yet the current histogram approach is a bit wasteful.

@jkschneider I haven't tried using the sparse histograms myself yet. Do they allow the user to specify their own buckets if they want to opt out of the behavior?

jkschneider commented 6 years ago

@checketts Currently no, specifying your own buckets just sets the limits on them. But it would be a trivial change to allow it. Could get sticky if we want something like what @yskopets did, manually specifying a linear range but also wanting the dynamism. What do we call the property that turns that behavior on and off?

checketts commented 6 years ago

What do we call the property that turns that behavior on and off?

I would recommend MeterRegistry.Config.useSparseHistogramBuckets(boolean) and have Prometheus default that to false.

Alternatively Config.setHistogramBucketDensity(HistogramBucketDensity) could be a good option (to avoid booleans) HistogamBucketDensity enum could be an enum with values of Sparse and Continuous

jkschneider commented 6 years ago

have Prometheus default that to false.

This may be a bit hasty. Let's see what the trouble is with histogram_quantile.

yskopets commented 6 years ago

@checketts A more complete example includes 3 observations:

histogram = DistributionSummary.builder("response_size_bytes")
                .description("Distribution of response sizes")
                .histogram(Histogram.linear(0, 10, 20))
                .register(registry)

histogram.record(15)
histogram.record(75)
histogram.record(5)

Expected behaviour (produced by Prometheus' simpleclient):

# HELP response_size_bytes Distribution of response sizes
# TYPE response_size_bytes histogram
response_size_bytes_bucket{le="0.0",} 0.0
response_size_bytes_bucket{le="10.0",} 1.0
response_size_bytes_bucket{le="20.0",} 2.0
response_size_bytes_bucket{le="30.0",} 2.0
response_size_bytes_bucket{le="40.0",} 2.0
response_size_bytes_bucket{le="50.0",} 2.0
response_size_bytes_bucket{le="60.0",} 2.0
response_size_bytes_bucket{le="70.0",} 2.0
response_size_bytes_bucket{le="80.0",} 3.0
response_size_bytes_bucket{le="90.0",} 3.0
response_size_bytes_bucket{le="100.0",} 3.0
response_size_bytes_bucket{le="110.0",} 3.0
response_size_bytes_bucket{le="120.0",} 3.0
response_size_bytes_bucket{le="130.0",} 3.0
response_size_bytes_bucket{le="140.0",} 3.0
response_size_bytes_bucket{le="150.0",} 3.0
response_size_bytes_bucket{le="160.0",} 3.0
response_size_bytes_bucket{le="170.0",} 3.0
response_size_bytes_bucket{le="180.0",} 3.0
response_size_bytes_bucket{le="190.0",} 3.0
response_size_bytes_bucket{le="+Inf",} 3.0
response_size_bytes_count 3.0
response_size_bytes_sum 95.0

Actual behaviour (produced by Micrometer):

# HELP response_size_bytes Distribution of response sizes
# TYPE response_size_bytes histogram
response_size_bytes_bucket{le="10.0",} 3.0
response_size_bytes_bucket{le="20.0",} 2.0
response_size_bytes_bucket{le="80.0",} 2.0
response_size_bytes_count 3.0
response_size_bytes_sum 95.0

So, a few questions (even if you intentionally omit empty buckets):

Why does bucket response_size_bytes_bucket{le="10.0",} have value 3 ?
Why bucket values are not cumulative ?

Here is another link to Prometheus documentation:

Writing Exporters (https://prometheus.io/docs/instrumenting/writing_exporters/):

Histograms are rare, if you come across one remember that the exposition format exposes cumulative values.

yskopets commented 6 years ago

Regarding histogram_quantile,

I've got the following Prometheus query (example from https://prometheus.io/docs/practices/histograms/#quantiles): histogram_quantile(0.95, sum(rate(response_size_bytes_bucket[5m])) by (le))
On a Histogram produced by simpleclient it returns 0
On a Histogram produced by Micrometer it returns NaN

jkschneider commented 6 years ago

Looks like there is a bug on adding cumulative buckets to the left of existing samples. That will be fixed shortly. I'm afraid that the bug surfaced by this specific scenario may be providing a false impression of what histogram_quantile can do more generally. Let's retest your scenario after the fix.

Here is the scenario I've tested histogram_quantile with previously. The outputs look correct for sparsely populated cumulative histograms.

brian-brazil commented 6 years ago

I agree if there is a strong argument for breakage in Prometheus we will be forced to ship all the buckets to Prom.

You need to ship all buckets to Prometheus, to do otherwise is broken.

It's a general principle with Prometheus that timeseries should not appear and disappear, as that's very hard to deal with correctly. In addition the buckets are cumulative so that if there's an excessive number of buckets (276 is definitely excessive), the person running the Prometheus can drop some of them without breaking anything. It's also required the buckets are the same across all instances of a job.

jkschneider commented 6 years ago

the person running the Prometheus can drop some of them without breaking anything

Thanks for the reasoning @brian-brazil around choosing cumulative histograms. Makes sense.

I think a good compromise here will be to still provide the preselected percentiles buckets that were introduced in #72 but add a configuration option to clamp the bucket domain with a min and/or max. This effectively limits the number of buckets shipped. It satisfies Prometheus' requirement that we ship all buckets all the time cumulatively while also satisfying the separate principle that users don't want to/know how to select buckets. There was a brief discussion around clamping here. We can make this a registry-wide configuration option for Timers and DistributionSummarys separately and allow for the option to be overridden on an individual endpoint.

Also, pair this with @checketts suggestion:

I would recommend MeterRegistry.Config.useSparseHistogramBuckets(boolean) and have Prometheus default that to false.

Since the 276 buckets cover the entire positive real line, some values are of course absurdly large for timers. So we can set a sensible default maximum clamp for timers that coincides with the distinction we draw between short and long-task timers (e.g. if it's over 2 minutes, you should probably be using a LTT).

A user shipping percentile histograms to Atlas would have no reason to clamp (but it would do no harm to the approximation as long as samples fell in the clamped domain).

brian-brazil commented 6 years ago

add a configuration option to clamp the bucket domain with a min and/or max.

That sounds fine. There's no set in stone way of how users should specify their buckets, so it'd be interesting to see if this is easier to work with.

allow for the option to be overridden on an individual endpoint.

Buckets need to be consistent across all of a given metric name, otherwise they can't be aggregated across all endpoints. Presuming "endpoint" is a label of some form here.

jkschneider commented 6 years ago

Buckets need to be consistent across all of a given metric name

I'll make a note of this in the Prometheus-specific docs. The same mechanism that would allow you to override the the clamped domain (@Timed annotation) also allows you to override the metric name and add additional tags.

MarkusKull commented 6 years ago

Prometheus-histograms are also described here: https://prometheus.io/docs/instrumenting/exposition_formats/ . There is another requirement which micrometer does not yet fulfil:

A histogram must have a bucket with {le="+Inf"}.

This requirement seems a bit strange, because the value is redundant to _count.

As for why prometheus has cumulative histograms: Probably because it allows to calculate histograms over arbitrary (beforehand unknown) time spans by simply comparing two scaped values.

brian-brazil commented 6 years ago

This requirement seems a bit strange, because the value is redundant to _count.

This is due to how PromQL works, you only pass in the _bucket time series.

As for why prometheus has cumulative histograms: Probably because it allows to calculate histograms over arbitrary (beforehand unknown) time spans by simply comparing two scaped values.

That's why our counters are cumulative over time, histogram buckets being cumulative is a separate thing.

jkschneider commented 6 years ago

@MarkusKull In the in-progress work I have locally, +Inf is back in.

micrometer-metrics / micrometer

Micrometer's `Histogram` produces invalid Prometheus time series #120