quarkusio / quarkus

Quarkus: Supersonic Subatomic Java.
https://quarkus.io
Apache License 2.0
13.64k stars 2.64k forks source link

Quarkus micrometer - Issue with distribution type metrics (stackdriver) #13393

Closed samodadela closed 3 years ago

samodadela commented 3 years ago

It seems that distribution type metrics don't work with StackDriver - values are sent, but they are all zeros.

I'd expect to see values of those metrics.

All distribution type metrics send zero's to StackDriver. The metrics in question are:

To reproduce, create some charts in GCP Monitoring Dashboards, start a quarkus app and observe if any value is being sent.

This happens with Quarkus 1.9.1 either with quarkus:dev or with deployed app (GKE). Using OpenJDK 11, maven build. The app is running in JVM mode (NOT Native).

Additional context There is an existing debate on Zulip (for latest info see, from Nov 11 onwards): https://quarkusio.zulipchat.com/#narrow/stream/187030-users/topic/quarkus.20micrometer.20Distribution.20type.20metrics

ghost commented 3 years ago

/cc @ebullient, @jmartisk

samodadela commented 3 years ago

Additional info

Prometheus reports non-zero values:

jvm_gc_pause_seconds_count{action="end of minor GC",cause="Metadata GC Threshold",} 2.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="Metadata GC Threshold",} 0.033
jvm_gc_pause_seconds_count{action="end of minor GC",cause="G1 Evacuation Pause",} 1.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="G1 Evacuation Pause",} 0.021

Some info from a debug session:

metric {
  labels {
    key: "action"
    value: "end of minor GC"
  }
  labels {
    key: "cause"
    value: "G1 Evacuation Pause"
  }
  type: "custom.googleapis.com/jvm/gc/pause"
}
resource {
  type: "global"
  labels {
    key: "project_id"
    value: "PID"
  }
}
metric_kind: GAUGE
value_type: DISTRIBUTION
points {
  interval {
    end_time {
      seconds: 1605104454
      nanos: 845000000
    }
  }
  value {
    distribution_value {
      bucket_options {
        explicit_buckets {
          bounds: 0.0
        }
      }
      bucket_counts: 0
    }
  }
}

breakpoint was in StackDriverMeterRegistry.java line 169...

                client.createTimeSeries(request);
                logger.debug("successfully sent {} TimeSeries to Stackdriver", partition.size());

I actually get called in:

Stream<TimeSeries> createTimeSeries(HistogramSupport histogramSupport, boolean timeDomain)

histogramSupport.id is for jvm.gc.pause ... but the values are always 0

zero values come from:

    @Override
    public HistogramSnapshot takeSnapshot(long count, double total, double max) {
        return HistogramSnapshot.empty(count, total, max);
    }

that's micrometer's NoopHistogram

samodadela commented 3 years ago

A screenshot from GCP Metrics explorer showing all metrics are 0:

image

ebullient commented 3 years ago

Ok. I was finally able to hook things up to send metrics myself, and.... I can't reproduce the issue. I will try a few more things. ;)

ebullient commented 3 years ago

I enabled stackdriver in my monster-combat application, and configured it to push to stackdriver.

changes are here: https://github.com/ebullient/monster-combat/tree/stackdriver

With or without debugging, the stackdriver registry was always using a valid (step distribution) histogram snapshot instance.

So I need to figure out what is different between your configuration and mine...

samodadela commented 3 years ago

Anything I can help with?

AFAIK apart for adding custom tags there is nothing special.

The only setting I use, is: quarkus.micrometer.export.stackdriver.project-id=myprojectid

I have both StackDriver and Prometheus registry as a dependency.

ebullient commented 3 years ago

I turned on logging.. can you double check your changes against mine, and turn some logging on to see if there are any other indications of something not loading properly?

https://github.com/ebullient/monster-combat/compare/stackdriver#diff-bb78680d9f1b89671ae22ad905101ec96caeb581925f8cbe770355eb31571bbdL9

samodadela commented 3 years ago

@ebullient

I have:

import io.micrometer.core.instrument.Tag;
import io.micrometer.core.instrument.config.MeterFilter;
import io.micrometer.stackdriver.StackdriverMeterRegistry;
import io.quarkus.micrometer.runtime.MeterFilterConstraint;
import org.project.events.config.AppInfo;

import javax.enterprise.inject.Produces;
import javax.inject.Inject;
import java.net.InetAddress;
import java.net.UnknownHostException;
import java.util.List;

// @Singleton
// @Priority(Interceptor.Priority.APPLICATION - 100)
public class MonitoringConfig {

    @Inject
    AppInfo appInfo;

    @Produces
    @MeterFilterConstraint(applyTo = StackdriverMeterRegistry.class)
    public MeterFilter configureStackdriverRegistries() {
        return commonTagsMeterFilter();
    }

    private MeterFilter commonTagsMeterFilter() {
        String environment = getEnvironment(appInfo);
        String userName = getUsername();
        String hostName = getHostName();
        String moduleName = getModuleName();
        return MeterFilter.commonTags(List.of(
                Tag.of("environment", environment),
                Tag.of("user", userName),
                Tag.of("host", hostName),
                Tag.of("module", moduleName)));
    }

On another note: The only thing I did here is comment out the Interceptor. Some time ago, on another issue, you told me to add it for the registration of tags... seems not needed - the tags are added even without it. I also moved MonitoringConfig from a 'common' jar directly to the app, just in case.

Then:

%dev.quarkus.micrometer.export.stackdriver.enabled=true
%dev.quarkus.micrometer.export.stackdriver.project-id=project
%dev.quarkus.micrometer.export.stackdriver.publish=true
%dev.quarkus.log.category."io.micrometer".level=TRACE

I DO NOT have this:

    @Produces
    public CombatMetrics createCombatMetrics(MeterRegistry registry) {
        return new CombatMetrics(registry);
    }

Can't find CombatMetrics in your sources... what does it do?

Then regarding debugging... some weird stuff is going on: If I run:

mvn quarkus:dev

from within the IntellJ window I sometimes I get some DISTRIBUTION metrics on stdout, also jvm/gc/pause... and the first time it's sent, there are some not 0 values... though the next time it is sent the values are 0...

Then I thought, I'd collect this in a logfile, using:

mvn quarkus:dev > /tmp/pause.log 2>&1

.. but... then I never get any DISTRIBUTION metric, event if I wait a very long time... I see other metrics being sent, though. I had to use the TRACE level.

$ cat /tmp/pause.log  | egrep "successfully sent|DIST" 
2020-11-25 22:54:12,840 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 22:55:11,480 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 22:56:11,445 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 22:57:11,453 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 22:58:11,415 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 22:59:11,450 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 23:00:11,404 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 23:01:11,448 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 23:02:11,481 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 23:03:11,460 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 23:04:11,472 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 23:05:11,441 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 23:06:11,456 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 23:07:11,451 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
samodadela commented 3 years ago

Output when running mvn:quarkus dev from IJ without redirection to file:

run1:

    type: "custom.googleapis.com/jvm/gc/pause"
  }
  resource {
    type: "global"
    labels {
      key: "project_id"
      value: "project"
    }
  }
  metric_kind: GAUGE
  value_type: DISTRIBUTION
  points {
    interval {
      end_time {
        seconds: 1606341161
        nanos: 35000000
      }
    }
    value {
      distribution_value {
        bucket_options {
          explicit_buckets {
            bounds: 0.0
          }
        }
        bucket_counts: 0
      }
    }
  }
}

run2: err... can't get it again ;) I swear I saw some values once or twice!

It's behaving very quantum mechanics like - if you observe it you change the result ;) Could it be some concurrency problem? I'm debugging in a crappy VM (i.e. sloooow system). Though the problem happens also while running on GKE.

samodadela commented 3 years ago

Almost forgot, my pom.xml:

        <dependency>
            <groupId>io.quarkus</groupId>
            <artifactId>quarkus-micrometer</artifactId>
        </dependency>
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-registry-stackdriver</artifactId>
            <exclusions>
                <exclusion>
                    <groupId>ch.qos.logback</groupId>
                    <artifactId>logback-classic</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-registry-prometheus</artifactId>
        </dependency>
ebullient commented 3 years ago

The default interval for stackdriver is 1 minute between batches, which will definitely feel like a long time when you're drumming your fingers to wait for something to happen. ;)

CombatMetrics wraps registration of some custom metrics in my application, it isn't relevant to what we're doing.

I think you're right that the interceptor isn't needed anymore. Quarkus is actively looking for @Produces of @MeterFilter. There is part of me that wants your monitoring config @Singleton (I prefer an explicit scope, but that's just me).

Can we try this just for fun?

quarkus.log.category."io.quarkus.micrometer".level=DEBUG

Maybe we'll see that something isn't being initialized when we think it is.. Given inheritance, the StackdriverMeterRegistry is a StepMeterRegistry ... there isn't any way that I know of for it to get a no-op histogram flavor (that flavor can certainly happen on the prometheus path..) https://github.com/micrometer-metrics/micrometer/blob/133d359406054fb9afe08d0c6ff0c4288118caf5/micrometer-core/src/main/java/io/micrometer/core/instrument/step/StepMeterRegistry.java#L94

samodadela commented 3 years ago

I know about the 1 minute interval, and from the log excerpt above (/tmp/pause.log) you can see that I did wear my fingers with tapping ;) 11 minutes of excruciating pain. The log above is:

quarkus.log.category."io.quarkus.micrometer".level=TRACE

Using DEBUG I only see the lines "successfully sent"...

I'll play around with:

quarkus.log.category."io.quarkus.micrometer".level=DEBUG

I'll also turn off prometheus just in case... but I'm pretty sure that the Noop histogram was called from within StackdriverMeterRegistry.

Can there be any race condition in the initialization process? From what I saw the behavior was not consistent - once the DISTRIBUTION type metrics were reported, the other time not.

BTW: Can the JDK (OpenJDK 11) cause problems? What are you using?

samodadela commented 3 years ago

BTW: Which metric were you watching while debugging? I was observing jvm/gc/pause... Also... I'm debugging the issue with Quarkus 1.9.1.

samodadela commented 3 years ago

@ebullient I did test again and found that it works indeed... I tried with prometheus disabled just in case. This time DISTRIBUTION metrics were reported. I found that to get non-zero values you actually need to perform some operation. For jvm/gc/pause I used visualvm and triggered garbage collection. Then I got:

  metric_kind: GAUGE
  value_type: DISTRIBUTION
  points {
    interval {
      end_time {
        seconds: 1607349277
        nanos: 897000000
      }
    }
    value {
      distribution_value {
        count: 1
        mean: 29.0
        bucket_options {
          explicit_buckets {
            bounds: 0.0
          }
        }
        bucket_counts: 1
      }
    }
  }

For the http server distribution metrics I did some http GET requests. Note that in contrast with the gc metric (where you get zeros from the start), the http server distribution metrics don't get reported until you perform the first request. Here's an example for custom.googleapis.com/http/server/requests:

  metric_kind: GAUGE
  value_type: DISTRIBUTION
  points {
    interval {
      end_time {
        seconds: 1607361097
        nanos: 896000000
      }
    }
    value {
      distribution_value {
        count: 7
        mean: 48.79254771428572
        bucket_options {
          explicit_buckets {
            bounds: 0.0
          }
        }
        bucket_counts: 7
      }
    }
  }
}

I don't quite understand what does the bucket_count number mean... It says 7 buckets, but there are no values for the buckets? If it's supposed to be like that, then I guess you can close the issue.

samodadela commented 3 years ago

BTW: those metrics can not be visualized as heatmaps - for that we would need buckets with data. I can use 'mean' and see it's value, though.

ebullient commented 3 years ago

I think this sounds about right.. w/o turning on additional data collection (using MeterFilter), there won't be explicit buckets.