Closed jakrawcz closed 2 years ago
Can you please include the queries you are using as well as the Prometheus scrape frequency?
Of course:
We observe this even on the most basic "just get one specific raw timeseries" queries:
Our scraping frequency is 30s
.
If we wanted to exclude the "scraping" aspect (or rather, the Prometheus server-side altogether), we can go to the exposition endpoint directly:
operations_timer_seconds_bucket{operation="something_that_always_takes_zero_nanos",le="0.001",} 2564657.0
operations_timer_seconds_bucket{operation="something_that_always_takes_zero_nanos",le="0.001048576",} 2564657.0
...
operations_timer_seconds_bucket{operation="something_that_always_takes_zero_nanos",le="2.505397588",} 2568322.0
operations_timer_seconds_bucket{operation="something_that_always_takes_zero_nanos",le="2.863311529",} 2568326.0
...
operations_timer_seconds_bucket{operation="something_that_always_takes_zero_nanos",le="28.633115306",} 2568326.0
operations_timer_seconds_bucket{operation="something_that_always_takes_zero_nanos",le="30.0",} 2568326.0
operations_timer_seconds_bucket{operation="something_that_always_takes_zero_nanos",le="+Inf",} 2568326.0
operations_timer_seconds_count{operation="something_that_always_takes_zero_nanos",} 2568326.0
operations_timer_seconds_sum{operation="something_that_always_takes_zero_nanos",} 2106.810172127
It looks like some actual samples were invented/misinterpreted (e.g. there are 4
samples between 2.505397588
and 2.863311529
) - and, as I said, we have verified from a few angles (logs, additional counter metric, independent other histogram) that we only ever call timer.record(0, TimeUnit.NANOSECONDS)
.
Do you see this same behavior when not using histograms? I wonder if the HDR histogram logic is the problem. Unfortunately I don't have much experience with the histogram logic/recordings.
@jakrawcz Could you please create a minimal Java app that reproduces the issue? I tried to reproduce it with this registry:
PrometheusMeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
and this timer:
Timer timer = Timer.builder("test.timer")
.tag("operation", "something_that_always_takes_zero_nanos")
.publishPercentileHistogram()
.register(registry);
doing this 100 million times:
timer.record(0, TimeUnit.NANOSECONDS);
and I got the expected result:
test_timer_seconds_max{operation="something_that_always_takes_zero_nanos",} 0.0
test_timer_seconds_count{operation="something_that_always_takes_zero_nanos",} 1.0E8
test_timer_seconds_sum{operation="something_that_always_takes_zero_nanos",} 0.0
and I got the expected result
Yes, unfortunately, on my local unit tests I could not reproduce it at all. I tried with multiple threads, I tried it while concurrently doing read-out of the metric - and I still could not see the bug.
Even worse - we have not seen the problem on our test environment deployment (which is supposed to look exactly like the production deployment; but of course is not hammered with the same heavy QPS of incoming request).
So: this makes the issue very specific to our application AND to our production usage patterns :(
@checketts
Do you see this same behavior when not using histograms?
We have already switched to the other Timer impl (after seeing that it fixes our problem). Since the only environment on which we can reproduce this is production, we cannot really play with switching back to Micrometer for experiments (i.e. not without risks).
@jonatan-ivanov
I really regret that we were not able to deliver a minimum app reproducing the problem. I plan to try again, since I now see one more potential cause: an interaction with some other registered Timer, which is getting non-0-duration samples, which maybe sometimes get wrongfully routed to the something_that_always_takes_zero_nanos
. Gonna test that.
This issue was created with a hope that "maybe some other user has seen something similar already" or "maybe it rings a bell to some developer" or "maybe there is a common and known misconfiguration that causes this". I acknowledge that without a reliable way to repro, no one can proceed with an actual investigation of a newly-discovered bug.
Anyways, if I get any progress on investigating from our side, I will definitely report back.
@jakrawcz did you have any luck reproducing the issue outside of your production environment?
Could you share more details about how the timers are being registered/retrieved in your application?
You aren't perhaps reusing a Timer.Builder
instance somewhere?
@shakuzen I've just re-checked, all our Timer.Builder
s are used exactly once (in a classic method-chain "builder" fashion).
I still wasn't able to repro this in a controlled way. I also don't have any new datapoints from our production env (because we already swapped to a different timer impl). It looks like the issue was very specific to our system, and we have "fixed" it, and no one else really has this problem - in which case, please feel free to just close this issue :+1:
Thank you for getting back to us, even though it's no longer a problem for you. I see in your original description that you were on version 1.0.6
. That version is before we disabled the PauseDetector
by default due to https://github.com/micrometer-metrics/micrometer/issues/844. The PauseDetector likely explains the behavior you were seeing. If you did want to use Micrometer Timers again, the issue shouldn't happen unless you explicitly configure a PauseDetector
in the latest, supported versions. Sorry I didn't notice the old version sooner and thought of this, but hopefully that explains the cause.
From https://github.com/micrometer-metrics/micrometer/issues/844#issuecomment-801033469:
Letting the pause detector (or anything else) add additional samples causes confusion to users looking at the counts and rates.
Wooooow, @shakuzen, that's it! The pause detector really could be the thing affecting the behavior on production! (where significant pauses can occur, unlike on local "minimum repro" scripts)
So, our primary problem was "we forgot to upgrade the micrometer dep version for 2 years" :) I am very glad that what we saw has a believable explanation :+1:
Describe the bug Histograms (and max/sum) reported by Micrometer to Prometheus indicate that some non-0 measurements have happened, even if all recorded measurements (with
void record(long amount, TimeUnit unit)
) were exactly 0 nanoseconds.Environment
To Reproduce How to reproduce the bug:
timer.record(0, TimeUnit.NANOSECONDS)
Expected behavior The chart of max recorded should be a constant 0:
Actual behavior The chart has a random bumps significantly above 0 (e.g. 2 seconds): (similarly to
_max
, the_sum
also should be 0, but in reality it just monotonically grows)Additional context
0
, as expected.