opentracing-contrib / java-metrics

Apache License 2.0
31 stars 9 forks source link

Add micrometer metrics reporter #27

Closed jpkrohling closed 6 years ago

jpkrohling commented 6 years ago

Signed-off-by: Juraci Paixão Kröhling juraci@kroehling.de

jpkrohling commented 6 years ago

@objectiser I'm submitting this for an initial review, even though this is not ready yet, as there's a test failing.

The test PrometheusMetricsReporterConfigurationTest#testMetricsReporter works when executed individually (mvn test -Dtest=io.opentracing.contrib.metrics.prometheus.spring.autoconfigure.PrometheusMetricsReporterConfigurationTest -DfailIfNoTests=false), indicating that there's some dirt left behind by some other test.

I added some debugging sysout to the test, as you'll surely note during the review. Here's a sample output on my local machine when running a simple mvn test:

Producing registry io.prometheus.client.CollectorRegistry@75361cf6
Finished printing out samples
...
Using registry on test: io.prometheus.client.CollectorRegistry@75361cf6
Name: testmetrics_seconds Type: SUMMARY Help:   Samples: [Name: testmetrics_seconds_count LabelNames: [error, operation, span_kind] labelValues: [false, testOp, client] Value: 0.0, Name: testmetrics_seconds_sum LabelNames: [error, operation, span_kind] labelValues: [false, testOp, client] Value: 0.0]
Name: testmetrics_seconds_max Type: GAUGE Help:   Samples: [Name: testmetrics_seconds_max LabelNames: [error, operation, span_kind] labelValues: [false, testOp, client] Value: 0.0]
Finished printing out samples

Those samples seem to come from the test PrometheusMetricsReporterConfigurationWithNameTest#testMetricsReporterName, but I couldn't figure out what could be leaking from it...

objectiser commented 6 years ago

@jpkrohling Did a quick test - and added a comment for one issue found.

Noticed that the metrics names are span_seconds_max/count/sum - where did the 'seconds' part come from?

We should probably have a discussion with the prometheus guys about the pros and cons of histogram representation for the type of data we are recording. As mentioned in the prometheus docs: "A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values." - so it automatically creates the count and sum info as well as buckets for the cumulative counters. However maybe this is then not as portable across other micrometer providers.

jpkrohling commented 6 years ago

Noticed that the metrics names are span_seconds_max/count/sum - where did the 'seconds' part come from?

This comes from Micrometer itself. Not sure there's a way to disable, but I can look into it if it's undesirable.

We should probably have a discussion with the prometheus guys about the pros and cons of histogram representation for the type of data we are recording.

+1, cc @simonpasquier

objectiser commented 6 years ago

This comes from Micrometer itself. Not sure there's a way to disable, but I can look into it if it's undesirable.

It might be good to do a quick investigation, to see if possible, as then atleast the metric names would be consistent with the previous version - but if not possible then I guess we can live with it.

simonpasquier commented 6 years ago

This comes from Micrometer itself. Not sure there's a way to disable, but I can look into it if it's undesirable.

This is defined here.

We should probably have a discussion with the prometheus guys about the pros and cons of histogram representation for the type of data we are recording. As mentioned in the prometheus docs: "A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values." - so it automatically creates the count and sum info as well as buckets for the cumulative counters.

From a Prometheus POV, it is recommended to go with histograms rather than summaries since the former can be aggregated server-side (eg give the 90th percentile for the metric foo across all the instances of my service). The trade-off is that you need to have a rough idea about the bucket sizes to compute accurate values.

However maybe this is then not as portable across other micrometer providers.

There's a section about it in the Micrometer documentation. IIUC Micrometer will make the proper translations depending on the backend (for instance it won't publish histograms if the backend doesn't support them).

objectiser commented 6 years ago

@simonpasquier Thanks for the reference - so looks like we need to use publishPercentileHistogram on the Timer, for prometheus anyway. @jpkrohling Looks like also may need to make sla,minimumExpectedValue and maximumExpectedValue optionally configurable.

jpkrohling commented 6 years ago

@objectiser, could you please review again?

jpkrohling commented 6 years ago

I just applied the requested changes. The most important one is around the percentile property. Would you mind reviewing this again? If it looks good, I'll squash this PR into a single commit.

The auto-config for the new properties (sla, maximum/minimum value, percentile, enable/disable percentile histogram) will be done in a follow-up PR.