open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.69k stars 884 forks source link

HistogramExemplarReservoir and performance impact #3953

Open cijothomas opened 5 months ago

cijothomas commented 5 months ago

https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#alignedhistogrambucketexemplarreservoir has the wording "This implementation MUST keep the last seen measurement that falls within a histogram bucket". Consider the example where Delta temporarily is used. What is the reasoning behind preferring the "last-seen" measurement within a histogram bucket? In other words, would "first-seen" be as good? (I am only talking about Delta temporality).

From a performance standpoint, "last-seen" causes additional work, without any additional value. Specifically, "last-seen" means the Reservoir must create a new Exemplar and overwrite it in the existing Exemplar slot. Overwriting itself might be okay, but creating a new Exemplar requires memory allocation for value,traceid,spanid,timestamp/dropped-attributes. It is entirely possible that value,traceid,spanid already exists in memory, so less of a concern, but timestamp has to be generated (system call + memory allocation most likely), and possibly dropped-attributes. Also, "first-seen", allows the possibility of further perf optimizations in sdk implementations. For example, if an Exemplar is already recorded in an interval, the SDK can short-circuit a lot, knowing the slot is already filled in, and no longer have to worry about Exemplar until next collection.

Proposal: Could we change the algorithm to use "first-seen" for Delta Aggregation, and keep current (last-seen) for Cumulative.

Alternate: If timestamp is the only perf concern, could we relax the wording on the time. It currently says "The time the API call was made to record a Measurement", but could we relax it, so that implementations can pick the endtime instead, and avoid the cost of obtaining time in hot path?

Related discussion (Cumulative focused): https://github.com/open-telemetry/opentelemetry-specification/issues/3891

Because of https://github.com/open-telemetry/opentelemetry-specification/pull/3943, I don't think this should block https://github.com/open-telemetry/opentelemetry-specification/pull/3870, but I'd love to see if we can change defaults before declaring stable.!

jsuereth commented 5 months ago

Regarding "last-seen" this was for compatibility with existing prometheus histogram behavior (as is the reservoir-matches-bucketing). We did a lot of experimentation around exemplars and I actually think we MAY be able to alter the default to use the FixedSizeReservoir, as we get decent performance/coverage out of that. We expect to spend some time on new algorithms.

In this investigation, though, we saw pretty good performance from FixedSizedReservoir. It has less guarantees about capturing "edges" but given its nature, may get better coverage of split-means, e.g.

Regarding performance - in Java we pre-allocate cells for the reservoir, so no allocations occur in the hot path. Is this ugly in a GC language, yes. Does it matter for performance? yes. See: https://github.com/open-telemetry/opentelemetry-java/blob/main/sdk/metrics/src/main/java/io/opentelemetry/sdk/metrics/internal/exemplar/ReservoirCell.java.

Alternate: If timestamp is the only perf concern, could we relax the wording on the time. It currently says "The time the API call was made to record a Measurement", but could we relax it, so that implementations can pick the endtime instead, and avoid the cost of obtaining time in hot path?

FYI - in Java we're using a special clock designed for trace that only hits one of the system clocks to diff duration from start-of-span. It helps dramatically here.

Unfortunately, I don't think we should relax this requirement, it's one of the key values of an exemplar vs. a metric in that it can help you time-align other signals with an incident.

jsuereth commented 5 months ago

Regarding the suggestions -

cijothomas commented 5 months ago

I think a "First seen" exemplar reservoir would be a welcome addition as a built-in thing. Whether or not it's the default for DELTA histograms, I'll have to think about, and would welcome user feedback. This would NOT violate the "look like prometheus" behavior, since they do not have delta histograms.

got it! Are you looking for feedback from end-users or from the people who implement Exemplars in OTel sdks? Most of end-users would not know (or care) if the SDK defaults pick first-seen vs last-seen in delta right? If that is the case, the only feedback we need is from java/go/dotnet/other exemplar implementors?

jack-berg commented 5 months ago

@cijothomas This issue seems to be asking for a new exemplar reservoir specific to histograms, and to consider using that new reservoir as the default for histograms. This shouldn't be a blocker to exemplar stabilization because:

  1. It can be added as a new spec'd reservoir later.
  2. We can change the default reservoir as a non-breaking change.

Can you either change the title of this request / framing of this issue, or open a new issue? Thanks.

svrnm commented 1 month ago

@cijothomas please take a look at @jack-berg's comment, can you provide the required details? Thank you!