Closed jpkrohling closed 6 years ago
@objectiser I'm submitting this for an initial review, even though this is not ready yet, as there's a test failing.
The test PrometheusMetricsReporterConfigurationTest#testMetricsReporter
works when executed individually (mvn test -Dtest=io.opentracing.contrib.metrics.prometheus.spring.autoconfigure.PrometheusMetricsReporterConfigurationTest -DfailIfNoTests=false
), indicating that there's some dirt left behind by some other test.
I added some debugging sysout to the test, as you'll surely note during the review. Here's a sample output on my local machine when running a simple mvn test
:
Producing registry io.prometheus.client.CollectorRegistry@75361cf6
Finished printing out samples
...
Using registry on test: io.prometheus.client.CollectorRegistry@75361cf6
Name: testmetrics_seconds Type: SUMMARY Help: Samples: [Name: testmetrics_seconds_count LabelNames: [error, operation, span_kind] labelValues: [false, testOp, client] Value: 0.0, Name: testmetrics_seconds_sum LabelNames: [error, operation, span_kind] labelValues: [false, testOp, client] Value: 0.0]
Name: testmetrics_seconds_max Type: GAUGE Help: Samples: [Name: testmetrics_seconds_max LabelNames: [error, operation, span_kind] labelValues: [false, testOp, client] Value: 0.0]
Finished printing out samples
Those samples seem to come from the test PrometheusMetricsReporterConfigurationWithNameTest#testMetricsReporterName
, but I couldn't figure out what could be leaking from it...
@jpkrohling Did a quick test - and added a comment for one issue found.
Noticed that the metrics names are span_seconds_max/count/sum
- where did the 'seconds' part come from?
We should probably have a discussion with the prometheus guys about the pros and cons of histogram representation for the type of data we are recording. As mentioned in the prometheus docs: "A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values." - so it automatically creates the count
and sum
info as well as buckets for the cumulative counters. However maybe this is then not as portable across other micrometer providers.
Noticed that the metrics names are span_seconds_max/count/sum - where did the 'seconds' part come from?
This comes from Micrometer itself. Not sure there's a way to disable, but I can look into it if it's undesirable.
We should probably have a discussion with the prometheus guys about the pros and cons of histogram representation for the type of data we are recording.
+1, cc @simonpasquier
This comes from Micrometer itself. Not sure there's a way to disable, but I can look into it if it's undesirable.
It might be good to do a quick investigation, to see if possible, as then atleast the metric names would be consistent with the previous version - but if not possible then I guess we can live with it.
This comes from Micrometer itself. Not sure there's a way to disable, but I can look into it if it's undesirable.
This is defined here.
We should probably have a discussion with the prometheus guys about the pros and cons of histogram representation for the type of data we are recording. As mentioned in the prometheus docs: "A histogram samples observations (usually things like request durations or response sizes) and counts them in configurable buckets. It also provides a sum of all observed values." - so it automatically creates the count and sum info as well as buckets for the cumulative counters.
From a Prometheus POV, it is recommended to go with histograms rather than summaries since the former can be aggregated server-side (eg give the 90th percentile for the metric foo
across all the instances of my service). The trade-off is that you need to have a rough idea about the bucket sizes to compute accurate values.
However maybe this is then not as portable across other micrometer providers.
There's a section about it in the Micrometer documentation. IIUC Micrometer will make the proper translations depending on the backend (for instance it won't publish histograms if the backend doesn't support them).
@simonpasquier Thanks for the reference - so looks like we need to use publishPercentileHistogram
on the Timer
, for prometheus anyway. @jpkrohling Looks like also may need to make sla
,minimumExpectedValue
and maximumExpectedValue
optionally configurable.
@objectiser, could you please review again?
I just applied the requested changes. The most important one is around the percentile
property. Would you mind reviewing this again? If it looks good, I'll squash this PR into a single commit.
The auto-config for the new properties (sla, maximum/minimum value, percentile, enable/disable percentile histogram) will be done in a follow-up PR.
Signed-off-by: Juraci Paixão Kröhling juraci@kroehling.de