Prometheus ClassCastException while scraping (Micrometer 1.13 / SB 3.3)

micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.

https://micrometer.io

Apache License 2.0

4.45k stars 981 forks source link

Prometheus ClassCastException while scraping (Micrometer 1.13 / SB 3.3) #5150

Closed knoobie closed 3 months ago

knoobie commented 4 months ago

Describe the bug Multiple Methods annotated with @Timed (or creating a Timer by hand) with the same value to group them together throws a ClassCastException once histogram is enabled on some - but not on all.

java.lang.ClassCastException: class io.prometheus.metrics.model.snapshots.SummarySnapshot$SummaryDataPointSnapshot cannot be cast to class io.prometheus.metrics.model.snapshots.HistogramSnapshot$HistogramDataPointSnapshot (io.prometheus.metrics.model.snapshots.SummarySnapshot$SummaryDataPointSnapshot and io.prometheus.metrics.model.snapshots.HistogramSnapshot$HistogramDataPointSnapshot are in unnamed module of loader 'app')

    at io.prometheus.metrics.expositionformats.OpenMetricsTextFormatWriter.writeClassicHistogramBuckets(OpenMetricsTextFormatWriter.java:125)
    at io.prometheus.metrics.expositionformats.OpenMetricsTextFormatWriter.writeHistogram(OpenMetricsTextFormatWriter.java:120)
    at io.prometheus.metrics.expositionformats.OpenMetricsTextFormatWriter.write(OpenMetricsTextFormatWriter.java:72)
    at io.micrometer.prometheusmetrics.PrometheusMeterRegistry.scrape(PrometheusMeterRegistry.java:167)
    at io.micrometer.prometheusmetrics.PrometheusMeterRegistry.scrape(PrometheusMeterRegistry.java:163)
    at io.micrometer.prometheusmetrics.PrometheusMeterRegistry.scrape(PrometheusMeterRegistry.java:136)

Environment

Micrometer version: 1.13 (bundled in Spring Boot 3.3)
Micrometer registry: Prometheus

To Reproduce

Smallest example by Andy Wilkinson (wilkinsona) (don't wanna spam tag him)


package com.example;

import io.micrometer.core.instrument.Timer;
import io.micrometer.core.instrument.Timer.Sample;
import io.micrometer.prometheusmetrics.PrometheusConfig;
import io.micrometer.prometheusmetrics.PrometheusMeterRegistry;

public class Reproducer {

    public static void main(String[] args) throws InterruptedException {
        PrometheusMeterRegistry registry = new PrometheusMeterRegistry(new PrometheusConfig() {

            @Override
            public String get(String key) {
                return null;
            }

        });

        timed(registry, false, "one");
        timed(registry, true, "two");

        System.out.println(registry.scrape("application/openmetrics-text; version=1.0.0; charset=utf-8"));
    }

    static void timed(PrometheusMeterRegistry registry, boolean histogram, String methodName) throws InterruptedException {
        Sample sample = Timer.start(registry);
        Thread.sleep(500);
        sample.stop(Timer.builder("example").publishPercentileHistogram(histogram).tag("method-name",  methodName).register(registry));
    }

}

Example with Spring involved: https://github.com/spring-projects/spring-boot/issues/40907#issuecomment-2132726649

Expected behavior No exception; using Timer with and without histogram should work.

Additional context Originally reported here: https://github.com/spring-projects/spring-boot/issues/40907

jonatan-ivanov commented 3 months ago

Thanks for the issue and thanks Andy for the reproducer!

Workaround

It seems this is only an issue with the Prometheus 1.x Client, using the 0.x Client does not seem to have this issue. As a workaround you can downgrade to the 0.x client by using micrometer-registry-prometheus-simpleclient instead of micrometer-registry-prometheus. Please see the release notes of Micrometer 1.13.0 as well as the Migration Guide. In a nutshell: we upgraded micrometer-registry-prometheus from Prometheus Client 0.x to 1.x but also kept the old client as-is under a different artifact: micrometer-registry-prometheus-simpleclient. (See a code example below.)

Investigation Notes

Here's another reproducer also proving that the 0.x Client does not have this issue (commented out):

io.micrometer.prometheusmetrics.PrometheusMeterRegistry registry = new io.micrometer.prometheusmetrics.PrometheusMeterRegistry(io.micrometer.prometheusmetrics.PrometheusConfig.DEFAULT);
// io.micrometer.prometheus.PrometheusMeterRegistry registry =  new io.micrometer.prometheus.PrometheusMeterRegistry(io.micrometer.prometheus.PrometheusConfig.DEFAULT);

Timer.builder("test").tag("index", "1").publishPercentileHistogram(false).register(registry).record(Duration.ofMillis(100));
Timer.builder("test").tag("index", "2").publishPercentileHistogram( true).register(registry).record(Duration.ofMillis(100));
System.out.println(registry.scrape());

Doing such thing seems quite weird to me, it produces an output like this:

# HELP test_seconds  
# TYPE test_seconds summary
test_seconds_count{index="1",} 1.0
test_seconds_sum{index="1",} 0.1
test_seconds_bucket{index="2",le="0.001",} 0.0
[...]
test_seconds_bucket{index="2",le="+Inf",} 1.0
test_seconds_count{index="2",} 1.0
test_seconds_sum{index="2",} 0.2

This means that test_seconds is neither a Prometheus Summary (it has buckets) nor a Prometheus Histogram (it does not have buckets). The TYPE says summary which is not true. I think this was possible since with the 0.x Client Micrometer had bigger freedom to do things (even if those things were invalid) and it was able to produce "invalid" results. The Prometheus 1.x Client is much restrictive so some things are not possible anymore (in general I think this is a good thing).

Let me reach out to the Prometheus Client maintainer but right now I feel the output should not be considered as valid and this should be fix in your code by

either turning histograms on for every Timer/LongTaskTimer/DistributionSummary with the same name.
or use a different name for meters where histograms are on/off

knoobie commented 3 months ago

Thanks for investigation; I've applied your mentioned workaround with adding histogram to all those cases.

Example / Context / Reasoning behind this: We use @Timed for all external methods of, for example, Controller to get information about througput - to group them all, all use e.g. controller_1 as name and add additional tags like "method=get" and so on to identify each method within the service. To reduce "costs" of storing operations where we know that they are fast / cheap we skip the histogram flag to reduce the amount of scraped data points.

jonatan-ivanov commented 3 months ago

If you are using a framework like Spring Boot, Quarkus, Micronaut, etc, those controllers are already instrumented so you don't need to re-instrument them with @Timed.

Btw, you can apply @Counted where you don't care about the duration and @Timed where you do (@Counted is also cheaper). In Prometheus, metrics under the same name should have the same type, the previous behavior was because Micrometer was able to generate basically any output which is not the case anymore.

jonatan-ivanov commented 3 months ago

After talking to the maintainer of the Prometheus Client, there is one more workaround you can do right now: in Prometheus the histogram is valid if it has a single bucket: +Inf. You can "approximate" this behavior with the current Micrometer API without any modifications by having two buckets using .serviceLevelObjectives(...):

PrometheusMeterRegistry registry = new PrometheusMeterRegistry(PrometheusConfig.DEFAULT);
Timer.builder("test").tag("index", "1")
    .serviceLevelObjectives(Duration.ofNanos(1)) // or you can use any meaningful value, like 100 ms
    .register(registry)
    .record(Duration.ofMillis(100));
Timer.builder("test").tag("index", "2")
    .publishPercentileHistogram( )
    .register(registry)
    .record(Duration.ofMillis(100));

System.out.println(registry.scrape());

Doing this means that sometimes you will have all the buckets and sometimes you will have only two buckets (not sure if it is good enough in your case). Unfortunately @Timed can't do SLOs right now but we can consider adding it, we already have a PR: https://github.com/micrometer-metrics/micrometer/pull/5145 please let us know if this is something you would use (by commenting on it/adding a 👍🏼).

I think I still not recommend doing the above since the buckets are different and because of that they are not aggregatable which kind of defeats the purpose of the histogram and I would still go with having summary and histogram with different names but since you are already doing this, you probably already aware of the consequences.

With this, let me close this issue, please let us know if these don't help and we can reopen.

opendevrunner commented 1 month ago

Hello,

the issue exists if the 'publishHistogram()' can not be used. The same ClassCastException occur if a histogram with defined ranges is used:

Timer.builder("test").tag("index", "1")
    .serviceLevelObjectives(Duration.ofNanos(1)) // or you can use any meaningful value, like 100 ms
    .register(registry)
    .record(Duration.ofMillis(100));
Timer.builder("test").tag("index", "2")
    .publishPercentiles(0.5, 0.75, 0.9, 0.95, 0.99,0.999 )
    .register(registry)
    .record(Duration.ofMillis(500));

From my point of view this is a breaking change as Meter with different tags are a normal case to differentiate results in Grafana. Such workarounds does not help. A solution without ClassCastExceptions and at least error reporting of problematic meters is required. The current implementation does not report ANY metric back to Prometheus.

Please recheck the decision to close the issue.

jonatan-ivanov commented 1 month ago

Thanks for raising this issue. This is a limitation of the Prometheus and the OpenMetrics formats, I wrote this down in the Migration Guide.

Micrometer using the old Prometheus Java Client (0.x) (micrometer-registry-prometheus-simpleclient) produced an invalid output that is not possible anymore since the new Prometheus Java Client (1.x) (micrometer-registry-prometheus) does not let it (hence the exception). This is in fact a ~breaking change in the behavior, please read the Migration Guide that explains what is happening and gives you much more details.

Since the new Prometheus Java Client (1.x) makes it impossible to bring back the old behavior which I'm not sure we should (it was invalid), I'm not sure what should we do on Micrometer side. Please let us know if you have any idea.

Also, please notice that if any changes mentioned in the Migration Guide impacts you negatively and you don't want to migrate right now, you can still use the new Prometheus Java Client (0.x) with Micrometer, you "just" need to change the dependency from io.micrometer:micrometer-registry-prometheus to io.micrometer:micrometer-registry-prometheus-simpleclient, something like this:

- runtimeOnly 'io.micrometer:micrometer-registry-prometheus:1.12.+'
+ runtimeOnly 'io.micrometer:micrometer-registry-prometheus-simpleclient:1.13.+'

Please see the Migration Guide for more details.

rschuetz commented 1 month ago

Thanks for your quick response @jonatan-ivanov, I'm a colleague of @opendevrunner.

The problem is not really the case where the client produces invalid output, more the case where types are mixed.

One use case we have is basically a generic performance measurement library that's using the same name (but different tags) for the different measured operations. In addition, the different operations can be configured independently with for additional SLO or percentiles or no additional details at all, whatever is needed or makes sense for the specific operation. There is no need to merge histograms or so for different operations, they're completely independent and just share the same name.

Operations are grouped and there is a generic board in Grafana showing the data for all operations with grouping and filtering. There are different panels for the optional SLOs etc. - some of them stay empty if not reported - but that's ok. Using different names for different types would break / complicate this approach.

jonatan-ivanov commented 1 month ago

The problem is not really the case where the client produces invalid output, more the case where types are mixed.

That's what I was referring as "invalid output": Micrometer mixes the Summary and the Histogram types with the old client. Doing this is invalid and the new client does not allow it.

In addition, the different operations can be configured independently with for additional SLO or percentiles or no additional details at all, whatever is needed or makes sense for the specific operation.

According to the Prometheus and the OpenMetrics specifications, this is invalid.

There is no need to merge histograms or so for different operations, they're completely independent and just share the same name.

If they share the same name they are not independent, they belong to the same family so they must have the same type.

If I understand correctly, this is what you want:

# TYPE test_seconds histogram
# HELP test_seconds
test_seconds_bucket{index="1",le="0.1"} 1
test_seconds_bucket{index="1",le="+Inf"} 1
test_seconds_count{index="1"} 1
test_seconds_sum{index="1"} 0.01
test_seconds{index="2",quantile="0.99"} 0.009961472
test_seconds_count{index="2"} 1
test_seconds_sum{index="2"} 0.01

If so, please notice that:

test_seconds_count, test_seconds_sum, test_seconds_bucket, and test_seconds belong to the same family, it has a name test_seconds and the type is histogram.
# TYPE test_seconds histogram is invalid, this is not a histogram since it contains quantiles.
summary would be an equally invalid type since it contains buckets.
Wether this happens under one combination of tags or not is irrelevant, since they share the same name they belong to the same family and they share the same type.
The output above is invalid according to the specs and the new Prometheus Client does not allow it.

If not, please let us know what you would like the Prometheus output to look like. Also, please let us know if you have any idea to fix it (please keep in mind that we would like to produce a valid output according to the specs and the client).

I think quantiles are not allowed on histograms since if you have a histogram, you can calculate the quantiles on the backend. Except in your case, since you use the histogram for SLOs, so you only have a few buckets and this calculation does not make sense since it would be inaccurate (you need more buckets for this: .publishPercentileHistogram()). In this use-case, I think it would make sense to add quantiles to the histogram but this needs to be done in the Prometheus and the OpenMetrics specs first. Please let us know if you want to go this route and I can ask around and help opening a thread with them.