Fix histogram consistency in PrometheusMeterRegistry

rhatlapa commented 3 months ago

As part of upgrade to spring boot 3.3 with micrometer 1.13.0 we are seeing issue similar to https://github.com/micrometer-metrics/micrometer/issues/4988, see the stacktrace below. I wasn't able to create separate reproducer, I can confirm that the issue is in micrometer as when using spring-boot 3.3 with downgraded micrometer to 1.12.6 the issue is no more visible.

Stacktrace:

j.l.IllegalArgumentException: Counts in ClassicHistogramBuckets cannot be negative.
    at i.p.m.m.s.ClassicHistogramBuckets.validate(ClassicHistogramBuckets.java:147)
    at i.p.m.m.s.ClassicHistogramBuckets.sortAndValidate(ClassicHistogramBuckets.java:108)
    at i.p.m.m.s.ClassicHistogramBuckets.of(ClassicHistogramBuckets.java:53)
    at i.m.p.PrometheusMeterRegistry.lambda$addDistributionStatisticSamples$29(PrometheusMeterRegistry.java:504)
    at i.m.p.MicrometerCollector.collect(MicrometerCollector.java:77)
    at i.p.m.m.r.PrometheusRegistry.scrape(PrometheusRegistry.java:72)
    at i.p.m.m.r.PrometheusRegistry.scrape(PrometheusRegistry.java:57)
    at o.s.b.a.m.e.p.PrometheusScrapeEndpoint.scrape(PrometheusScrapeEndpoint.java:59)
    at j.i.r.GeneratedMethodAccessor128.invoke(Unknown Source)
    at j.i.r.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at j.l.reflect.Method.invoke(Method.java:568)
    at o.s.u.ReflectionUtils.invokeMethod(ReflectionUtils.java:281)
    at o.s.b.a.e.i.r.ReflectiveOperationInvoker.invoke(ReflectiveOperationInvoker.java:74)
    at o.s.b.a.e.a.AbstractDiscoveredOperation.invoke(AbstractDiscoveredOperation.java:60)
    at o.s.b.a.e.w.s.AbstractWebMvcEndpointHandlerMapping$ServletWebOperationAdapter.handle(AbstractWebMvcEndpointHandlerMapping.java:327)
    at o.s.b.a.e.w.s.AbstractWebMvcEndpointHandlerMapping$OperationHandler.handle(AbstractWebMvcEndpointHandlerMapping.java:434)
    at j.i.r.GeneratedMethodAccessor110.invoke(Unknown Source)
    at j.i.r.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at j.l.reflect.Method.invoke(Method.java:568)
    at o.s.w.m.s.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:255)
    at o.s.w.m.s.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:188)
    at o.s.w.s.m.m.a.ServletInvocableHandlerMethod.invokeAndHandle(ServletInvocableHandlerMethod.java:118)
    at o.s.w.s.m.m.a.RequestMappingHandlerAdapter.invokeHandlerMethod(RequestMappingHandlerAdapter.java:926)
    at o.s.w.s.m.m.a.RequestMappingHandlerAdapter.handleInternal(RequestMappingHandlerAdapter.java:831)
    at o.s.w.s.m.m.AbstractHandlerMethodAdapter.handle(AbstractHandlerMethodAdapter.java:87)
    at o.s.b.a.a.w.s.CompositeHandlerAdapter.handle(CompositeHandlerAdapter.java:58)
    at o.s.w.s.DispatcherServlet.doDispatch(DispatcherServlet.java:1089)
    at o.s.w.s.DispatcherServlet.doService(DispatcherServlet.java:979)
    at o.s.w.s.FrameworkServlet.processRequest(FrameworkServlet.java:1014)
    at o.s.w.s.FrameworkServlet.doGet(FrameworkServlet.java:903)
    ... 29 frames truncated

Note our application is standard REST application using Mongo & Feign clients to do requests against external endpoints. It also creates custom metrics with percentile histograms like this:

        meterRegistry.counter(counter_metric_name, tags.stream().collect(Collectors.toList()).increment();
        Timer.builder(timer_metric_name)
                .tags(tags.stream().collect(Collectors.toList()))
                .publishPercentileHistogram()
                .register(meterRegistry).record(duration);

Note: we use Micrometer with prometheus and running on Java 17.0.11

I wasn't able to reproduce the issue outside of the production environment. Any suggestions for a fix or what could have caused it?

knoobie commented 3 months ago

I just had the exact same problem; interesting side note: The Exception only occured once while scraping for us - meaning the next scrape 15 seconds later did not have the problem anymore. Not like my other reported issue that breaks scraping completely.

nadina-f commented 3 months ago

Hello, After upgrading my project from Spring Boot 3.2.5 to 3.3.0 and micrometer version from 1.12.5 to 1.13.0, I have the same issue. Theoretically this issue should have bee fixed in 1.13.0 as described here: https://github.com/micrometer-metrics/micrometer/commit/d7b9d247df9b10a2834479ee0936cafa7588e5b8, but it's not.

Do you have any plans to release soon, covering also this bug? Thank you!

jonatan-ivanov commented 3 months ago

Workaround

It seems this is only an issue with the Prometheus 1.x Client, using the 0.x Client does not seem to have this issue. As a workaround you can downgrade to the 0.x client by using micrometer-registry-prometheus-simpleclient instead of micrometer-registry-prometheus. Please see the release notes of Micrometer 1.13.0 as well as the Migration Guide. In a nutshell: we upgraded micrometer-registry-prometheus from Prometheus Client 0.x to 1.x but also kept the old client as-is under a different artifact: micrometer-registry-prometheus-simpleclient.

Investigation Notes

Based on the additional data in https://github.com/micrometer-metrics/micrometer/issues/5223, thanks for @VladimirZaitsev21, I think the issue is similar to https://github.com/micrometer-metrics/micrometer/issues/4988 but it definitely seems like a different one: in https://github.com/micrometer-metrics/micrometer/issues/4988, the issue was limited to LongTaskTimer and this seems to be happening for Timer (I guess also for DistributionSummary).

Based on the data from https://github.com/micrometer-metrics/micrometer/issues/5223, it seems the issue is caused by inconsistency between count and the last bucket:

HistogramSnapshot{
   count=575660,
   ...
   histogramCounts=[
       ...
       (575661.0 at 5.0E9)
   ]
}

As you can see here, count is 575660 while the last bucket is 575661.0. This should not happen, count should not be less than the last bucket. I'm not sure if we have this situation with the Prometheus Client 0.x (I think we do) but if this occurs, it does not cause a problem since we don't need to calculate the +Inf bucket there, with the Prometheus 1.x client we need to, see: https://github.com/micrometer-metrics/micrometer/blob/f4be539c2e2d5759bd1774ffb103569c025f3aad/implementations/micrometer-registry-prometheus/src/main/java/io/micrometer/prometheusmetrics/PrometheusMeterRegistry.java#L494-L498

If I need to guess, this is a timing/concurrency issue with the histogram implementation we have for Prometheus and count and the last bucket are sometimes inconsistent (e.g.: scrape happens during a recording and the bucket is already incremented while the counter is not).

jonatan-ivanov commented 3 months ago

I'm still investigating but it seems like the issue is what I described above (inconsistency between count and buckets that were not surfaced with the 0.x Client), here's a reproducer if you want to play with it:

public class HistogramDemo {
    static PrometheusMeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
    static Timer timer = Timer.builder("test").publishPercentileHistogram().register(registry);

    static ExecutorService scrapeExecutor = Executors.newFixedThreadPool(16);
    static ExecutorService recordExecutor = Executors.newFixedThreadPool(16);
    static CountDownLatch latch = new CountDownLatch(1);

    public static void main(String[] args) {
        List<Future<?>> futures = new ArrayList<>();
        for (int i = 0; i < 32; i++) {
            futures.add(scrapeExecutor.submit(HistogramDemo::scrape));
            futures.add(recordExecutor.submit(HistogramDemo::record));
        }
        System.out.println("Tasks submitted, releasing the Kraken...");
        latch.countDown();
        waitForFutures(futures);
        scrapeExecutor.shutdown();
        recordExecutor.shutdown();
        System.out.println(registry.scrape());
    }

    static void record() {
        waitForLatch();
        timer.record(Duration.ofMillis(100));
    }

    static void scrape() {
        waitForLatch();
        registry.scrape();
    }

    static void waitForLatch() {
        try {
            latch.await();
        }
        catch (InterruptedException e) {
            throw new RuntimeException(e);
        }
    }

    static void waitForFutures(List<Future<?>> futures) {
        for (Future<?> future : futures) {
            try {
                future.get();
            }
            catch (Exception e) {
                future.cancel(true);
                System.out.println(e.getMessage());
//                e.printStackTrace();
            }
        }
    }
}

If I also print out the count and the negative buckets, I get negative values for the +Inf bucket.

jonatan-ivanov commented 3 months ago

I think I fixed this in https://github.com/micrometer-metrics/micrometer/commit/796c1e5f39b1d44da94aec1ee0760b73214f28b2 (please see the commit message for details). I also added some jcstress tests.

1.13.2-SNAPSHOT is published in our snapshot repo (see readme), could you please try out if you can still see the issue?

anetamoravcikova commented 3 months ago

Hello, I see that the issue is fixed, can you please tell us when the fix will be published for production use?

knoobie commented 3 months ago

If you click on the milestone (top right) you can find an estimated release date.

micrometer-metrics / micrometer

Fix histogram consistency in PrometheusMeterRegistry #5193

Workaround

Investigation Notes