Closed samodadela closed 3 years ago
/cc @ebullient, @jmartisk
Additional info
Prometheus reports non-zero values:
jvm_gc_pause_seconds_count{action="end of minor GC",cause="Metadata GC Threshold",} 2.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="Metadata GC Threshold",} 0.033
jvm_gc_pause_seconds_count{action="end of minor GC",cause="G1 Evacuation Pause",} 1.0
jvm_gc_pause_seconds_sum{action="end of minor GC",cause="G1 Evacuation Pause",} 0.021
Some info from a debug session:
metric {
labels {
key: "action"
value: "end of minor GC"
}
labels {
key: "cause"
value: "G1 Evacuation Pause"
}
type: "custom.googleapis.com/jvm/gc/pause"
}
resource {
type: "global"
labels {
key: "project_id"
value: "PID"
}
}
metric_kind: GAUGE
value_type: DISTRIBUTION
points {
interval {
end_time {
seconds: 1605104454
nanos: 845000000
}
}
value {
distribution_value {
bucket_options {
explicit_buckets {
bounds: 0.0
}
}
bucket_counts: 0
}
}
}
breakpoint was in StackDriverMeterRegistry.java line 169...
client.createTimeSeries(request);
logger.debug("successfully sent {} TimeSeries to Stackdriver", partition.size());
I actually get called in:
Stream<TimeSeries> createTimeSeries(HistogramSupport histogramSupport, boolean timeDomain)
histogramSupport.id is for jvm.gc.pause ... but the values are always 0
zero values come from:
@Override
public HistogramSnapshot takeSnapshot(long count, double total, double max) {
return HistogramSnapshot.empty(count, total, max);
}
that's micrometer's NoopHistogram
A screenshot from GCP Metrics explorer showing all metrics are 0:
Ok. I was finally able to hook things up to send metrics myself, and.... I can't reproduce the issue. I will try a few more things. ;)
I enabled stackdriver in my monster-combat application, and configured it to push to stackdriver.
changes are here: https://github.com/ebullient/monster-combat/tree/stackdriver
With or without debugging, the stackdriver registry was always using a valid (step distribution) histogram snapshot instance.
So I need to figure out what is different between your configuration and mine...
Anything I can help with?
AFAIK apart for adding custom tags there is nothing special.
The only setting I use, is: quarkus.micrometer.export.stackdriver.project-id=myprojectid
I have both StackDriver and Prometheus registry as a dependency.
I turned on logging.. can you double check your changes against mine, and turn some logging on to see if there are any other indications of something not loading properly?
@ebullient
I have:
import io.micrometer.core.instrument.Tag;
import io.micrometer.core.instrument.config.MeterFilter;
import io.micrometer.stackdriver.StackdriverMeterRegistry;
import io.quarkus.micrometer.runtime.MeterFilterConstraint;
import org.project.events.config.AppInfo;
import javax.enterprise.inject.Produces;
import javax.inject.Inject;
import java.net.InetAddress;
import java.net.UnknownHostException;
import java.util.List;
// @Singleton
// @Priority(Interceptor.Priority.APPLICATION - 100)
public class MonitoringConfig {
@Inject
AppInfo appInfo;
@Produces
@MeterFilterConstraint(applyTo = StackdriverMeterRegistry.class)
public MeterFilter configureStackdriverRegistries() {
return commonTagsMeterFilter();
}
private MeterFilter commonTagsMeterFilter() {
String environment = getEnvironment(appInfo);
String userName = getUsername();
String hostName = getHostName();
String moduleName = getModuleName();
return MeterFilter.commonTags(List.of(
Tag.of("environment", environment),
Tag.of("user", userName),
Tag.of("host", hostName),
Tag.of("module", moduleName)));
}
On another note: The only thing I did here is comment out the Interceptor. Some time ago, on another issue, you told me to add it for the registration of tags... seems not needed - the tags are added even without it. I also moved MonitoringConfig from a 'common' jar directly to the app, just in case.
Then:
%dev.quarkus.micrometer.export.stackdriver.enabled=true
%dev.quarkus.micrometer.export.stackdriver.project-id=project
%dev.quarkus.micrometer.export.stackdriver.publish=true
%dev.quarkus.log.category."io.micrometer".level=TRACE
I DO NOT have this:
@Produces
public CombatMetrics createCombatMetrics(MeterRegistry registry) {
return new CombatMetrics(registry);
}
Can't find CombatMetrics in your sources... what does it do?
Then regarding debugging... some weird stuff is going on: If I run:
mvn quarkus:dev
from within the IntellJ window I sometimes I get some DISTRIBUTION metrics on stdout, also jvm/gc/pause... and the first time it's sent, there are some not 0 values... though the next time it is sent the values are 0...
Then I thought, I'd collect this in a logfile, using:
mvn quarkus:dev > /tmp/pause.log 2>&1
.. but... then I never get any DISTRIBUTION metric, event if I wait a very long time... I see other metrics being sent, though. I had to use the TRACE level.
$ cat /tmp/pause.log | egrep "successfully sent|DIST"
2020-11-25 22:54:12,840 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 22:55:11,480 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 22:56:11,445 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 22:57:11,453 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 22:58:11,415 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 22:59:11,450 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 23:00:11,404 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 23:01:11,448 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 23:02:11,481 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 23:03:11,460 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 23:04:11,472 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 23:05:11,441 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 23:06:11,456 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
2020-11-25 23:07:11,451 DEBUG [io.mic.sta.StackdriverMeterRegistry] (stackdriver-metrics-publisher) successfully sent 53 TimeSeries to Stackdriver
Output when running mvn:quarkus dev from IJ without redirection to file:
run1:
type: "custom.googleapis.com/jvm/gc/pause"
}
resource {
type: "global"
labels {
key: "project_id"
value: "project"
}
}
metric_kind: GAUGE
value_type: DISTRIBUTION
points {
interval {
end_time {
seconds: 1606341161
nanos: 35000000
}
}
value {
distribution_value {
bucket_options {
explicit_buckets {
bounds: 0.0
}
}
bucket_counts: 0
}
}
}
}
run2: err... can't get it again ;) I swear I saw some values once or twice!
It's behaving very quantum mechanics like - if you observe it you change the result ;) Could it be some concurrency problem? I'm debugging in a crappy VM (i.e. sloooow system). Though the problem happens also while running on GKE.
Almost forgot, my pom.xml:
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-micrometer</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-stackdriver</artifactId>
<exclusions>
<exclusion>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
The default interval for stackdriver is 1 minute between batches, which will definitely feel like a long time when you're drumming your fingers to wait for something to happen. ;)
CombatMetrics wraps registration of some custom metrics in my application, it isn't relevant to what we're doing.
I think you're right that the interceptor isn't needed anymore. Quarkus is actively looking for @Produces
of @MeterFilter
. There is part of me that wants your monitoring config @Singleton
(I prefer an explicit scope, but that's just me).
Can we try this just for fun?
quarkus.log.category."io.quarkus.micrometer".level=DEBUG
Maybe we'll see that something isn't being initialized when we think it is.. Given inheritance, the StackdriverMeterRegistry is a StepMeterRegistry ... there isn't any way that I know of for it to get a no-op histogram flavor (that flavor can certainly happen on the prometheus path..) https://github.com/micrometer-metrics/micrometer/blob/133d359406054fb9afe08d0c6ff0c4288118caf5/micrometer-core/src/main/java/io/micrometer/core/instrument/step/StepMeterRegistry.java#L94
I know about the 1 minute interval, and from the log excerpt above (/tmp/pause.log) you can see that I did wear my fingers with tapping ;) 11 minutes of excruciating pain. The log above is:
quarkus.log.category."io.quarkus.micrometer".level=TRACE
Using DEBUG I only see the lines "successfully sent"...
I'll play around with:
quarkus.log.category."io.quarkus.micrometer".level=DEBUG
I'll also turn off prometheus just in case... but I'm pretty sure that the Noop histogram was called from within StackdriverMeterRegistry.
Can there be any race condition in the initialization process? From what I saw the behavior was not consistent - once the DISTRIBUTION type metrics were reported, the other time not.
BTW: Can the JDK (OpenJDK 11) cause problems? What are you using?
BTW: Which metric were you watching while debugging? I was observing jvm/gc/pause... Also... I'm debugging the issue with Quarkus 1.9.1.
@ebullient I did test again and found that it works indeed... I tried with prometheus disabled just in case. This time DISTRIBUTION metrics were reported. I found that to get non-zero values you actually need to perform some operation. For jvm/gc/pause I used visualvm and triggered garbage collection. Then I got:
metric_kind: GAUGE
value_type: DISTRIBUTION
points {
interval {
end_time {
seconds: 1607349277
nanos: 897000000
}
}
value {
distribution_value {
count: 1
mean: 29.0
bucket_options {
explicit_buckets {
bounds: 0.0
}
}
bucket_counts: 1
}
}
}
For the http server distribution metrics I did some http GET requests. Note that in contrast with the gc metric (where you get zeros from the start), the http server distribution metrics don't get reported until you perform the first request. Here's an example for custom.googleapis.com/http/server/requests:
metric_kind: GAUGE
value_type: DISTRIBUTION
points {
interval {
end_time {
seconds: 1607361097
nanos: 896000000
}
}
value {
distribution_value {
count: 7
mean: 48.79254771428572
bucket_options {
explicit_buckets {
bounds: 0.0
}
}
bucket_counts: 7
}
}
}
}
I don't quite understand what does the bucket_count number mean... It says 7 buckets, but there are no values for the buckets? If it's supposed to be like that, then I guess you can close the issue.
BTW: those metrics can not be visualized as heatmaps - for that we would need buckets with data. I can use 'mean' and see it's value, though.
I think this sounds about right.. w/o turning on additional data collection (using MeterFilter), there won't be explicit buckets.
It seems that distribution type metrics don't work with StackDriver - values are sent, but they are all zeros.
I'd expect to see values of those metrics.
All distribution type metrics send zero's to StackDriver. The metrics in question are:
To reproduce, create some charts in GCP Monitoring Dashboards, start a quarkus app and observe if any value is being sent.
This happens with Quarkus 1.9.1 either with quarkus:dev or with deployed app (GKE). Using OpenJDK 11, maven build. The app is running in JVM mode (NOT Native).
Additional context There is an existing debate on Zulip (for latest info see, from Nov 11 onwards): https://quarkusio.zulipchat.com/#narrow/stream/187030-users/topic/quarkus.20micrometer.20Distribution.20type.20metrics