open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.75k stars 889 forks source link

Configuring histograms in the API (hints?) #2229

Closed mateuszrzeszutek closed 1 year ago

mateuszrzeszutek commented 2 years ago

What are you trying to achieve?

I'm working on the micrometer->OTel bridge instrumentation in the javaagent. Some of the micrometer instruments translate to OTel concepts in a rather straightforward way, but some of them are pretty different. I had the most problems with histograms (used in e.g. micrometer Timer). Histograms in micrometer are fully configurable via the micrometer API. Micrometer supports calculating percentiles on the client side, setting histogram buckets, rotating/expiring histogram buckets, and some other options that I don't fully comprehend yet. Meanwhile, you can't configure histograms in OTel metrics API at all - you can use Views, but that's in the SDK; and pretty much the only thing you can configure in a view are explicit buckets. For now, in my instrumentation I can just fall back to the micrometer way of collecting histogram data (which basically is two sets of gauges with different tags for percentiles/count buckets) - but perhaps we should introduce a way to configure these things in the metrics API (hints)?

Additional context.

Micrometer bridge PR: https://github.com/open-telemetry/opentelemetry-java-instrumentation/pull/4919

CC @jsuereth

bogdandrutu commented 2 years ago

See https://github.com/open-telemetry/opentelemetry-specification/issues/2232#issuecomment-1004401666

reyang commented 2 years ago

Related to #1753 (Metrics Hint API).

weyert commented 2 years ago

The inability to define histogram bucket sizes in the API is quite a inconvenience. As the consumer of a library (e.g. express metrics, host metrics) need to be aware:

Quite a few steps when you just want to 'throw' a Express middleware which creates Express metrics like http request/response size and http duration or the hos metrics.

I made this mistake and now I am having a mismatch of bucket sizes for the same metrics in Prometheus not sure how to solve that at the moment

jsuereth commented 1 year ago

Another user request for this: https://cloud-native.slack.com/archives/C014L2KCTE3/p1669739859064699?thread_ts=1669652480.479969&cid=C014L2KCTE3

jmacd commented 1 year ago

To get Lightstep onto using the OTel APIs I had to add several off-spec metrics features features (mainly the synchronous gauge aggregation), which would ideally be expressed via hints. The powerful thing about an API-level hint is that it allows configuring things when the instrument is created as opposed to when the SDK is started. For the Lightstep metrics SDK we support configuring the exponential histogram size, e.g., https://github.com/lightstep/otel-launcher-go/tree/main/lightstep/sdk/metric#metric-instrument-hints-api

asafm commented 1 year ago

I left, and comment in Slack, and I'm also leaving it here.

When you create a bucket-based histogram, you use the API, in which you don’t specify it’s a bucket-based histogram. You have to do that via a view, which in the Java SDK seems to be done in the initialization of your app. So it seems that defining a bucket-based histogram, with the buckets you’d like, is split into two separate locations - the actual file where the logic using the histogram is and the app in a file in which you define the view and specify the name of the histogram. I’m contrasting this with how I’m used to working with metrics in DropWizard / Prometheus Simple Client and other metrics libraries in which all is done in place.

I understand the power of overriding buckets/aggregation type for a given instrument if you do not own that instrument - i.e., in a library you use in your app. Yet, as a user - be it the library writer or just you, the app developer: you just create an instrument, like a histogram, and you would like to define its aggregation in the exact location when it's created - it's the natural place for this. My only guess is that 80% of the time, you'd be doing that, and only 20% of the time you'll override the aggregation type. Moreover, as an application developer, you need to start creating constants to share the metric name, so if the metric name is changed, its aggregation type associated with it will also be matched (in the view declaration). Seem like there is incidental complexity here.

Therefore I think enabling the user to specify the aggregation type and configure it seems pretty important to exist in the API itself.

jack-berg commented 1 year ago

As I'm reading through the feedback I'm noticing that virtually all of it is related to the lack of ability to define histogram configuration options at the point of instrumentation.

What if we narrow the scope (as the PR title suggests) and instead of designing a hint API just fix the histogram instrument?

When you think about it, API users are actually already making "hints" to the SDK by their choice of instrument. By using a counter you're hinting that a sum is probably sufficient even though the SDK could keep a full histogram to track the distribution of measurements. By recording a set of attributes you're hinting at the set of dimensions that are likely to be useful, even though the SDK may drop some to reduce cardinality.

Today, histogram creation accepts name, description (optional), unit (optional), and value_type as arguments. We could add an additional optional argument called something like histogram_aggregation_options.

histogram_aggregation_options could consist of the type of histogram to use (exponential bucket or explicit bucket) as well as options for the selected type (max size for exponential, bucket boundaries for explicit).

The main question would be to determine how these options interact with the SDK's logic for determining aggregation, which involves MetricReader's default output aggregation and Views. I would suggest solving this by adjusting the definition of the Default Aggregation to incorporate the options specified by the API caller when creating the histogram instrument.

The result would be:

What do folks think? To me this seems like a pretty tractable way to provide a lot of the value, without boiling the ocean.

tsloughter commented 1 year ago

When you think about it, API users are actually already making "hints" to the SDK by their choice of instrument.

I just wanted to second this important point. There is no guarantee your histogram will even be aggregated as a histogram, adding additional option to instrument creation that is passed to the default instrument type aggregation seems clean and clear, and what people are asking for.

asafm commented 1 year ago

The result would be:

If the instrument matches one or more view's selection criteria, use the aggregation specified by the view. Else if the metric reader (or associated metric exporter) specifies a default output aggregation for the instrument type, use it. Else use the default aggregation, which incorporates the options specified by the API caller when creating the instrument.

The way I see it: Specifying it on the instrumentation level presents the default - meaning, if no other override mechanism was specified, this should be used. I guess my problem is with the next in the chain, as you mentioned: the reader. When you say, "This is the default for a certain instrumentation type", for me, it means: if no aggregation were defined, we'd default to what I'm configuring here in the reader. So I wouldn't see it as an override to what the user says but as a default.

Next is the view mechanism, which, by definition, is an override (if you use the same name, that is).

tsloughter commented 1 year ago

The reader basically defines what an instrument's aggregation is, so the defaults are what the reader considers defaults. The view is the only thing that can define a different aggregation.

asafm commented 1 year ago

One question popped into my mind: Will Hints be typed? If so, will the aggregations allowed be constrained only to the ones specified here, i.e., Explicit Bucket and Exponential Bucket? I'm asking because I was thinking of creating an SDK for my purposes which will require using the Summary type of aggregation, and I wanted to know if I can rely on the API for creating instruments for my needs.

jack-berg commented 1 year ago

I'm asking because I was thinking of creating an SDK for my purposes which will require using the Summary type of aggregation, and I wanted to know if I can rely on the API for creating instruments for my needs.

If you're referring to an aggregation that produces summary type metrics, that wouldn't be possible because there currently is no summary aggregation. However, you can produce a histogram that looks very similar to a summary by configuring an empty list of bucket boundaries, putting all measurements in a single bucket and providing: min, max, sum, count.

asafm commented 1 year ago

If you're referring to an aggregation that produces summary type metrics, that wouldn't be possible because there currently is no summary aggregation.

My thinking was creating my own SDK (my flavor, so it doesn't match the SDK spec). The protocol supports a Summary, as can also be seen in the Java SDK io.opentelemetry.sdk.metrics.data.SummaryPointData. What you suggest only support min/max and lack quantiles :)

asafm commented 1 year ago

@jack-berg How can we move your idea forward of amending the Histogram interface to have a way to configure aggregation and its options/configuration?

jack-berg commented 1 year ago

@jack-berg How can we move your idea forward of amending the Histogram interface to have a way to configure aggregation and its options/configuration?

Someone needs to open a PR to the speec with the proposed changes. Often helps to have a prototype implementation to drive discussion. There will likely be a debate on whether the histogram API can be changed in isolation or should be considered within a wider (and still ambiguous) "hint API". Assuming the PR to update the spec is merged, language SDKs can go and implement the changes. I'm interested in this topic but juggling quite a few things so can't commit on driving it on a particular timeline.

asafm commented 1 year ago

🥳 Brilliant work @jack-berg