Bug: PrometheusTimer reports >0 metrics when fed only with =0 durations

micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.

https://micrometer.io

Apache License 2.0

4.48k stars 992 forks source link

Bug: PrometheusTimer reports >0 metrics when fed only with =0 durations #3039

Closed jakrawcz closed 2 years ago

jakrawcz commented 2 years ago

Describe the bug Histograms (and max/sum) reported by Micrometer to Prometheus indicate that some non-0 measurements have happened, even if all recorded measurements (with void record(long amount, TimeUnit unit)) were exactly 0 nanoseconds.

Environment

Micrometer version: 1.0.6
Micrometer registry: Prometheus
OS: Linux
Java version: corretto-11.0.13.8.1

To Reproduce How to reproduce the bug:

Create any Timer with a Prometheus-backed Micrometer.
Report only exact 0s, from multiple threads, over a few hours:
- make sure to use timer.record(0, TimeUnit.NANOSECONDS)
- for us, the effect is visible after getting ~millions of samples
Look at the reported metrics, e.g. max.

Expected behavior The chart of max recorded should be a constant 0:

Actual behavior The chart has a random bumps significantly above 0 (e.g. 2 seconds): (similarly to _max, the _sum also should be 0, but in reality it just monotonically grows)

Additional context

We are sure that only 0s are recorded, since we incidentally have another, independent, simple Counter metric, which is only bumped when the recorded duration is >0 - and this counter stays at 0.
Despite tries, we have failed to capture a minimum reproducing unit test. It is likely that the problem only exposes under actual, heavy load of our production system.
To verify the observations above, we have installed another, independent timer (not Micrometer) recording the same duration samples and reporting its metrics (under a different name) to the same Prometheus. Its histogram (and max/sum) is flat at 0, as expected.

checketts commented 2 years ago

Can you please include the queries you are using as well as the Prometheus scrape frequency?

jakrawcz commented 2 years ago

Of course:

We observe this even on the most basic "just get one specific raw timeseries" queries:

Our scraping frequency is 30s.

If we wanted to exclude the "scraping" aspect (or rather, the Prometheus server-side altogether), we can go to the exposition endpoint directly:

operations_timer_seconds_bucket{operation="something_that_always_takes_zero_nanos",le="0.001",} 2564657.0
operations_timer_seconds_bucket{operation="something_that_always_takes_zero_nanos",le="0.001048576",} 2564657.0
...
operations_timer_seconds_bucket{operation="something_that_always_takes_zero_nanos",le="2.505397588",} 2568322.0
operations_timer_seconds_bucket{operation="something_that_always_takes_zero_nanos",le="2.863311529",} 2568326.0
...
operations_timer_seconds_bucket{operation="something_that_always_takes_zero_nanos",le="28.633115306",} 2568326.0
operations_timer_seconds_bucket{operation="something_that_always_takes_zero_nanos",le="30.0",} 2568326.0
operations_timer_seconds_bucket{operation="something_that_always_takes_zero_nanos",le="+Inf",} 2568326.0
operations_timer_seconds_count{operation="something_that_always_takes_zero_nanos",} 2568326.0
operations_timer_seconds_sum{operation="something_that_always_takes_zero_nanos",} 2106.810172127

It looks like some actual samples were invented/misinterpreted (e.g. there are 4 samples between 2.505397588 and 2.863311529) - and, as I said, we have verified from a few angles (logs, additional counter metric, independent other histogram) that we only ever call timer.record(0, TimeUnit.NANOSECONDS).

checketts commented 2 years ago

Do you see this same behavior when not using histograms? I wonder if the HDR histogram logic is the problem. Unfortunately I don't have much experience with the histogram logic/recordings.

jonatan-ivanov commented 2 years ago

@jakrawcz Could you please create a minimal Java app that reproduces the issue? I tried to reproduce it with this registry:

PrometheusMeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);

and this timer:

Timer timer = Timer.builder("test.timer")
        .tag("operation", "something_that_always_takes_zero_nanos")
        .publishPercentileHistogram()
        .register(registry);

doing this 100 million times:

timer.record(0, TimeUnit.NANOSECONDS);

and I got the expected result:

test_timer_seconds_max{operation="something_that_always_takes_zero_nanos",} 0.0
test_timer_seconds_count{operation="something_that_always_takes_zero_nanos",} 1.0E8
test_timer_seconds_sum{operation="something_that_always_takes_zero_nanos",} 0.0

jakrawcz commented 2 years ago

and I got the expected result

Yes, unfortunately, on my local unit tests I could not reproduce it at all. I tried with multiple threads, I tried it while concurrently doing read-out of the metric - and I still could not see the bug.

Even worse - we have not seen the problem on our test environment deployment (which is supposed to look exactly like the production deployment; but of course is not hammered with the same heavy QPS of incoming request).

So: this makes the issue very specific to our application AND to our production usage patterns :(

@checketts

Do you see this same behavior when not using histograms?

We have already switched to the other Timer impl (after seeing that it fixes our problem). Since the only environment on which we can reproduce this is production, we cannot really play with switching back to Micrometer for experiments (i.e. not without risks).

@jonatan-ivanov

I really regret that we were not able to deliver a minimum app reproducing the problem. I plan to try again, since I now see one more potential cause: an interaction with some other registered Timer, which is getting non-0-duration samples, which maybe sometimes get wrongfully routed to the something_that_always_takes_zero_nanos. Gonna test that.

This issue was created with a hope that "maybe some other user has seen something similar already" or "maybe it rings a bell to some developer" or "maybe there is a common and known misconfiguration that causes this". I acknowledge that without a reliable way to repro, no one can proceed with an actual investigation of a newly-discovered bug.

Anyways, if I get any progress on investigating from our side, I will definitely report back.

shakuzen commented 2 years ago

@jakrawcz did you have any luck reproducing the issue outside of your production environment? Could you share more details about how the timers are being registered/retrieved in your application? You aren't perhaps reusing a Timer.Builder instance somewhere?

jakrawcz commented 2 years ago

@shakuzen I've just re-checked, all our Timer.Builders are used exactly once (in a classic method-chain "builder" fashion).

I still wasn't able to repro this in a controlled way. I also don't have any new datapoints from our production env (because we already swapped to a different timer impl). It looks like the issue was very specific to our system, and we have "fixed" it, and no one else really has this problem - in which case, please feel free to just close this issue :+1:

shakuzen commented 2 years ago

Thank you for getting back to us, even though it's no longer a problem for you. I see in your original description that you were on version 1.0.6. That version is before we disabled the PauseDetector by default due to https://github.com/micrometer-metrics/micrometer/issues/844. The PauseDetector likely explains the behavior you were seeing. If you did want to use Micrometer Timers again, the issue shouldn't happen unless you explicitly configure a PauseDetector in the latest, supported versions. Sorry I didn't notice the old version sooner and thought of this, but hopefully that explains the cause.

jakrawcz commented 2 years ago

From https://github.com/micrometer-metrics/micrometer/issues/844#issuecomment-801033469:

Letting the pause detector (or anything else) add additional samples causes confusion to users looking at the counts and rates.

Wooooow, @shakuzen, that's it! The pause detector really could be the thing affecting the behavior on production! (where significant pauses can occur, unlike on local "minimum repro" scripts)

So, our primary problem was "we forgot to upgrade the micrometer dep version for 2 years" :) I am very glad that what we saw has a believable explanation :+1: