micrometer-metrics / micrometer

An application observability facade for the most popular observability tools. Think SLF4J, but for observability.
https://micrometer.io
Apache License 2.0
4.45k stars 981 forks source link

OTLP Histogram missing bucket count for positive infinity #3934

Closed lenin-jaganathan closed 1 year ago

lenin-jaganathan commented 1 year ago

Describe the bug https://github.com/open-telemetry/opentelemetry-proto/blob/4967b51b5cb29f725978362b9d413bae1dbb641c/opentelemetry/proto/metrics/v1/metrics.proto#L426-L427

The above spec from OpenTelemetry specifies that the count of bucket_counts should be one greater than the count of explicit_bounds with the last bucket_count representing the count of values between last bucket and Positive Infinity. The current implementation in Micormeter doesn't track the values between the last bucket and POSITIVE_INFINITY by default. Even if PSOITIVE_INFINITY is added as an SLO value, the micrometer will add an explicit_bound with the value POSITIVE_INFINITY since the micrometer does a strict 1-1 mapping of values-count.

Environment

To Reproduce How to reproduce the bug: Configure an OTLPMeterRegistry and add a Timer with histograms enabled.

Expected behavior bucket_count should be one greater than explicit_bounds and it should represent values between last bucket and positive infinity.

shakuzen commented 1 year ago

Thanks for bringing this to our attention. Maybe it's just me, but comments in a proto definition don't seem like the best place to specify these things. There's no mention of this in the published documentation for histograms. Could you open an issue with OpenTelemetry to get the relevant info reflected in the documentation? I'm also wondering how OTLP producers are supposed to catch these kinds of issues. It would have been nice if we got some validation failure or at least warning somewhere. Neither the proto nor the OTel collector complain about bucket_count having the wrong number of buckets.

lenin-jaganathan commented 1 year ago

Maybe the word "must" in the proto is the cue for producers. I will open up an issue on the open telemetry docs to address this.

What I see is some of the exporters we are using currently, just ignore all the buckets because the count is not one greater than buckets. I did spend a lot of time debugging but using a LoggingExporter which just prints things as is and I don't see any logs for that. I have to read the proto and try other SDKs (I used NodeJs) to see the difference is the number of counts.

lenin-jaganathan commented 1 year ago

https://github.com/micrometer-metrics/micrometer/pull/3935 - PR for fixing this. Somehow default linking of issues to PR didn't work. Because, I didn't target the main branch.

I have targeted 1.11.x branch, since 1.10.x and 1.11.x will cause conflicts due to a lot of changes between them (mainly due to delta aggregation support).