Explicit bounds incompatible with Prometheus

bogdandrutu commented 3 years ago

From @richih “Prometheus uses le, OTel uses ge bounded buckets. The two are mathematically incompatible and impossible to transform from one into the other."

I think this can be fixed in-place or via a deprecation:

In-place even if this may look to be a backwards incompatible change, OTEL histograms are not currently emitted by any stable release library, and I think it is ok even if emitted by an OTLP source to change this. Buckets are used to calculate percentiles, and the edge cases where this will produce different results are very few (one source records only values equal with one of the boundaries). PS: I would actually try to think if that in case of doubles, do we correctly compare them?
During the PR #255, we can also do this change, and when receiving the deprecated fields, simply ignore the fact the issue related to le/ge, essentially do what the first proposal will do for in-place transformation.

I would like the otel data-model for Histogram explicit boundaries to be 100% compatible with Prometheus (one of the biggest source of this type of Histograms), so I think it is important to solve this.

/cc @jmacd @open-telemetry/specs-metrics-approvers

alolita commented 3 years ago

@bogdandrutu notes that this change to be compatible with the Prometheus implementation may be incompatible with Stackdriver (which uses the OpenCensus implementation). This would be a breaking change.

@jsuereth will confirm and comment on issue.

jdmontana commented 3 years ago

I work on Google's monitoring backend. I am of the opinion that while it is true the buckets represented by GEQ or LEQ bounds are mathematically different, in practice the differences are not that important and will affect only a tiny number of users, which is the same opinion I presented in a metrics SIG meeting a couple months ago.

The users I've come across who I would expect to be affected by which bound is inclusive would almost all be those who use histogram buckets as if they were separate counters for an exact count at the lower bound - or presumably upper bound, in the case of Prometheus - but this is rare, and I imagine is easily solvable for almost all these users by either rejiggering their buckets or slightly updating the queries they use. Other than that, I have never come across anyone depending on our backend with a level of precision that requires distinguishing whether an edge of a bucket is inclusive or exclusive (e.g., insofar as any user might be invested in whether they receive alerts for a 95th percentile latency of 2.000000 ms vs 2.000001 ms, they are already greatly exceeding that level of uncertainty from a lot of different sources). Functionally this means they are mostly interchangeable and I am amenable to whichever is likely to lead to wider adoption/acceptance.

open-telemetry / opentelemetry-proto

Explicit bounds incompatible with Prometheus #258