Open shakuzen opened 2 years ago
The code was originally added with https://github.com/micrometer-metrics/micrometer/commit/9fa83e3d3638edf725aa5d803ae0b52b9017456f. I am not sure but I think the intention is to make the histogram buckets monotonic across Prometheus scrapes so that when a task completes, buckets do not decrease in value. @jkschneider can you confirm?
Unfortunately, this doesn't work because multiple Prometheus instances can scrape in the same interval, which causes bucket counts to double each scrape when the number of active tasks is the same. The difference will not be consistent between scrapes for the same Prometheus instance, either. Here are two tests. The first demonstrating the behavior we want, which passes. The second fails, which shows the problematic behavior with multiple scrapes (calls to takeSnapshot
).
class CumulativeHistogramLongTaskTimerTest {
MockClock clock = new MockClock();
MeterRegistry registry = new SimpleMeterRegistry(SimpleConfig.DEFAULT, clock) {
@Override
protected LongTaskTimer newLongTaskTimer(Meter.Id id, DistributionStatisticConfig distributionStatisticConfig) {
return new CumulativeHistogramLongTaskTimer(id, clock, getBaseTimeUnit(), distributionStatisticConfig);
}
};
@Test
void bucketCountsAreMonotonic_whenTasksComplete() {
LongTaskTimer ltt = LongTaskTimer.builder("my.ltt").publishPercentileHistogram().register(registry);
LongTaskTimer.Sample sample = ltt.start();
clock.add(1, TimeUnit.MINUTES);
CountAtBucket[] countAtBuckets = ltt.takeSnapshot().histogramCounts();
assertThat(countAtBuckets[0].bucket(TimeUnit.MINUTES)).isEqualTo(2);
assertThat(countAtBuckets[0].count()).isOne();
sample.stop();
// bucket counts after task completed should not decrease
countAtBuckets = ltt.takeSnapshot().histogramCounts();
assertThat(countAtBuckets[0].count()).isOne();
}
@Test
@Issue("#2744")
void bucketCountsStableWhenMultipleSnapshotsTaken() {
LongTaskTimer ltt = LongTaskTimer.builder("my.ltt").publishPercentileHistogram().register(registry);
// one task at 20 minutes, one task at 5 minutes
ltt.start();
clock.add(15, TimeUnit.MINUTES);
ltt.start();
clock.add(5, TimeUnit.MINUTES);
// the snapshot should be the same since the tasks and clock do not change
for (int i = 0; i < 3; i++) {
CountAtBucket[] countAtBuckets = ltt.takeSnapshot().histogramCounts();
int index = 0;
while (countAtBuckets[index].bucket(TimeUnit.NANOSECONDS) < Duration.ofMinutes(5).toNanos()) {
assertThat(countAtBuckets[index++].count()).isZero();
}
while (countAtBuckets[index].bucket(TimeUnit.NANOSECONDS) < Duration.ofMinutes(20).toNanos()) {
assertThat(countAtBuckets[index++].count()).isOne();
}
while (index < countAtBuckets.length) {
assertThat(countAtBuckets[index++].count()).isEqualTo(2);
}
}
}
}
It seems like we would need some time windowing to not add the previous interval until one full interval has passed, but this still is prone to inconsistency. Different scrapes may happen at different times in the interval with a different number of active tasks, but only one scrape's snapshot will be used as the basis to add in the next interval.
Alternatively, GaugeHistogram is exactly what we want and doesn't require the logic of adding the previous snapshot to maintain monotonicity. Unfortunately, GaugeHistogram
is not available in Prometheus' text format; it's only defined in OpenMetrics.
I'm noticing behavior that doesn't look right related to this. If I scrape the prometheus endpoint while there is an active task tracked with a LongTaskTimer and histogram enabled, the bucket counts double with each scrape. All other stats related to the LongTaskTimer (
+Inf
bucket, active count, duration sum, max) do not change over scrapes, and they go to zero when there are no active tasks. The histogram bucket counts (except for+Inf
) do not go to zero.https://github.com/micrometer-metrics/micrometer/blob/f04bb3218c713b555d2bd3737d940c88fd12808a/micrometer-core/src/main/java/io/micrometer/core/instrument/internal/CumulativeHistogramLongTaskTimer.java#L59
Originally posted by @shakuzen in https://github.com/micrometer-metrics/micrometer/pull/2075#r687514745