open-telemetry / opentelemetry-proto

OpenTelemetry protocol (OTLP) specification and Protobuf definitions
https://opentelemetry.io/docs/specs/otlp/
Apache License 2.0
581 stars 252 forks source link

Explicit bounds incompatible with Prometheus #258

Closed bogdandrutu closed 3 years ago

bogdandrutu commented 3 years ago

From @richih “Prometheus uses le, OTel uses ge bounded buckets. The two are mathematically incompatible and impossible to transform from one into the other."

I think this can be fixed in-place or via a deprecation:

I would like the otel data-model for Histogram explicit boundaries to be 100% compatible with Prometheus (one of the biggest source of this type of Histograms), so I think it is important to solve this.

/cc @jmacd @open-telemetry/specs-metrics-approvers

alolita commented 3 years ago

@bogdandrutu notes that this change to be compatible with the Prometheus implementation may be incompatible with Stackdriver (which uses the OpenCensus implementation). This would be a breaking change.

@jsuereth will confirm and comment on issue.

jdmontana commented 3 years ago

I work on Google's monitoring backend. I am of the opinion that while it is true the buckets represented by GEQ or LEQ bounds are mathematically different, in practice the differences are not that important and will affect only a tiny number of users, which is the same opinion I presented in a metrics SIG meeting a couple months ago.

The users I've come across who I would expect to be affected by which bound is inclusive would almost all be those who use histogram buckets as if they were separate counters for an exact count at the lower bound - or presumably upper bound, in the case of Prometheus - but this is rare, and I imagine is easily solvable for almost all these users by either rejiggering their buckets or slightly updating the queries they use. Other than that, I have never come across anyone depending on our backend with a level of precision that requires distinguishing whether an edge of a bucket is inclusive or exclusive (e.g., insofar as any user might be invested in whether they receive alerts for a 95th percentile latency of 2.000000 ms vs 2.000001 ms, they are already greatly exceeding that level of uncertainty from a lot of different sources). Functionally this means they are mostly interchangeable and I am amenable to whichever is likely to lead to wider adoption/acceptance.