micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.
https://micrometer.io
Apache License 2.0
4.39k stars 965 forks source link

Timer max expiry inconsistent with percentiles expiry #3298

Open gdabski opened 1 year ago

gdabski commented 1 year ago

Describe the bug A single recorded time (sample) affects a Timer's max() for a longer time than percentiles produced by the Timer. With default DistributionStatisticConfig the effective expiry is one minute for the percentiles and three minutes for max. The issue is related to how TimeWindowMax and AbstractTimeWindowHistogram interpret the value of DistributionStatisticConfig.expiry. Both use ring buffers of same size to implement the decay, but while the former only moves by one buffer position in intervals equal to expiry, the latter is implemented to do a full rotation of the buffer in the same time.

Environment

To Reproduce

    @Test
    void demonstratesDifferentExpiryForMaxAndPercentiles() throws InterruptedException {
        MeterRegistry meterRegistry = new SimpleMeterRegistry();
        Timer timer = Timer.builder("test").publishPercentiles(new double[]{0.5}).register(meterRegistry);
        Duration checkInterval = Duration.ofSeconds(30);

        timer.record(Duration.ofMillis(200));

        Duration testDuration = Duration.ZERO;
        var max = timer.max(MILLISECONDS);
        var snapshot = timer.takeSnapshot();

        while (max != 0) {
            System.out.printf("%ss: %s max %.0f%n", testDuration.toSeconds(), snapshot.percentileValues()[0], max);

            Thread.sleep(checkInterval.toMillis());

            testDuration = testDuration.plus(checkInterval);
            max = timer.max(MILLISECONDS);
            snapshot = timer.takeSnapshot();
        }

        System.out.printf("%ss: %s max %.0f%n", testDuration.toSeconds(), snapshot.percentileValues()[0], max);
    }

Prints:

0s: (1.92937984E8 at 50.0%) max 200
30s: (1.92937984E8 at 50.0%) max 200
60s: (0.0 at 50.0%) max 200
90s: (0.0 at 50.0%) max 200
120s: (0.0 at 50.0%) max 200
150s: (0.0 at 50.0%) max 200
180s: (0.0 at 50.0%) max 0

Expected behaviour A single sample ceases to affect percentiles and timer max at the same point in time.

Related issues In #2751 there is a complaint about max not expiring in expected time, but response was that the TimeWindowMax uses expiry (and bufferLength) from DistributionStatisticConfig right.

rafalh commented 5 days ago

I just stepped onto this issue. I think I found the root cause - it is the way how durationBetweenRotatesMillis is initialized. For some reasons expiry time from configuration is divided there by bufferLength (ageBuckets variable) unlike in TimeWindowMax where expiry time is taken directly. This means that effective expiry time (when metric is reset to 0 after single request) for max metric is expiry * bufferLength but for percentiles it is only expiry. AFAIK documentation does not reflect this difference and for me it is confusing so I'm assuming it is a bug. For the default configuration where expiry/step is set to 1 minute and bufferLength is set to 3, assuming metrics are scrapped every minute, in unfortunate case when scrapping occurs immediately after metrics are rotated, requests from 20 seconds window out of 1 minute (33% of data) are not taken into account in percentile metrics at all... I'm planning to create a PR changing durationBetweenRotatesMillis initialization.