open-telemetry / oteps

OpenTelemetry Enhancement Proposals
https://opentelemetry.io
Apache License 2.0
337 stars 164 forks source link

A proposal for SDK configurations for metric aggregation (Basic Views) #126

Closed jkwatson closed 4 years ago

jkwatson commented 4 years ago

This proposal is less than a full-blown Views API. The goal is just to give some flexibility to operators and exporter authors around how metrics should be aggregated by the SDK.

jkwatson commented 4 years ago

@open-telemetry/specs-metrics-approvers Please take a look at this. If there are no objections to this proposal - in principle, not in the exact details, please approve, so we can merge and get to work on having a more concrete specification for it.

james-bebbington commented 4 years ago

Overall, this proposal looks really good. Nice and to the point 😄

I do have some concerns, however, around not providing the ability to have multiple aggregations per instrument for GA:

It seems like it would be fairly straightforward to extend this proposal to include the ability to have multiple aggregations per instrument, and I can't see any blockers for adding that in the future. Is the main reason for not doing that now to avoid implementation complexity?

TLDR: I think I'd feel a little more comfortable if this proposal included a few more details about how this would likely be extended to support multi-aggregation use cases in the future, as I think it's something we would want to make sure gets included shortly after GA.


Additional Thoughts / Considerations for this OTEP:

  1. What happens if you configure two default aggregations for the same instrument? Does this raise an error or the second just silently take preference? What are the trade-offs there?
  2. Do we want to provide any way for an instrumentation library, e.g. gRPC, to provide recommended default aggregations?
tsloughter commented 4 years ago

While I also still have OC metrics at work that utilize multi-aggregation on the same measure, what really struck me about the last comment was the "provide recommended default aggregations" line.

I hadn't thought about yet how I'd convert the metrics in my grpc library to opentelemetry.

Right now it exports default views (aggregations) the user can choose to subscribe to specific ones in the list, all or none: https://github.com/tsloughter/grpcbox/blob/master/src/grpcbox_oc_stats.erl#L76

Maybe multi-aggregates has to come in a separate PR because it also needs to allow specifying the labels to keep per-aggregation.

jkwatson commented 4 years ago

I don't know why an instrumentation library would ever want to recommend default aggregations. How can the grpc instrumentation writer know anything about backends or how the data needs to be delivered to one?

jkwatson commented 4 years ago

@james-bebbington Can you elaborate on this? I don't understand what this means, or how it applies to a single instrument in a single process.

We have use cases at Google where being able to export multiple aggregations per instrument is critical, e.g. A data service is used by various client services. Each of these client services exports stats broken down by specific labels that they inject upstream via Correlation Context.

tsloughter commented 4 years ago

I don't know why an instrumentation library would ever want to recommend default aggregations. How can the grpc instrumentation writer know anything about backends or how the data needs to be delivered to one?

The library author definitely knows what metrics and aggregations may be of interest to the user. In the link I gave it shows client and server aggregations for sizes, latencies, number of requests.

OC has specs for default views for common protocols like http, grpc, sql, etc https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/gRPC.md

jkwatson commented 4 years ago

I don't know why an instrumentation library would ever want to recommend default aggregations. How can the grpc instrumentation writer know anything about backends or how the data needs to be delivered to one?

The library author definitely knows what metrics and aggregations may be of interest to the user. In the link I gave it shows client and server aggregations for sizes, latencies, number of requests.

OC has specs for default views for common protocols like http, grpc, sql, etc https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/gRPC.md

Available Instruments, sure, but aggregations are going to depend on the backend, aren't they?

tsloughter commented 4 years ago

What do you mean by backend? Aside from a user wanting to define their own distribution buckets based on their particular services latencies I'm not sure what you mean.

Do you mean like if a view is a histogram the user will want/need to specify the particular histogram impl to use?

jkwatson commented 4 years ago

The aggregation generates metrics, which get sent somewhere (the telemetry backend). Those backends support different shapes of data being sent to them. Some may support histograms, some may not. Some may support MMSC summaries, some may not. In general, the instrumentation author can't possibly know where the data will end up, and hence can't tell people what aggregations to use.

In my mental model, instrumentation uses Instruments with from the API and gives them meaningful names (generally based on semantic conventions). Those instruments have default aggregations assigned to them. If the operator's telemetry backend doesn't want the defaults, then the operator should use this proposed API to change those aggregations.

tsloughter commented 4 years ago

Oh, I took it as up to the exporter or collector to deal with metrics in shapes that backends can't handle. Might not be perfect and still entail a user to override a default to use X instead of Y but that in the end the intention was to not have, "what aggregations does my backend support", be a regular thing the user should have to think about when defining an aggregation.

jkwatson commented 4 years ago

Oh, I took it as up to the exporter or collector to deal with metrics in shapes that backends can't handle. Might not be perfect and still entail a user to override a default to use X instead of Y but that in the end the intention was to not have, "what aggregations does my backend support", be a regular thing the user should have to think about when defining an aggregation.

But, an exporter can't turn a MMSC summary into a histogram, or a sum into a set of quantiles, right? In addition to capabilities, the operator might say "it's really important to get lots of details on metric xyz, and I know my backend supports quantiles, so I'll configure that aggregation, with these settings". Again, this isn't something that an instrumentation author can have any knowledge of.

tsloughter commented 4 years ago

or a sum into a set of quantiles, right?

Right, though I don't like having that in the first place :)

"I know my backend supports quantiles, so I'll configure that aggregation, with these settings". Again, this isn't something that an instrumentation author can have any knowledge of.

True, but the instrumentation author does know what are likely useful aggregations, such as a distribution of latencies per method/route, and can make this simple for the user to enable if Otel provides the right tools.

tsloughter commented 4 years ago

I don't know if I should propose a separate OTEP or what but I'm realizing now I think instruments and aggregations are too tied together.

Thinking about how a view would just be a name, description, aggregation, label list and instrument name I realized this it also needs to be defined as where the checkpoint is and not the instrument.

My proposal would make this change as well as removal any aggregation from the instrument definition. Instead, the instrument kind informs the SDK of what default aggregation to use for creating a default view per instrument.

Recording a measurement to an instrument updates each view that is bound to it (preferably which is also subscribed to an exporter, but that can be an optimization for later), the accumulator checkpoints each view and the integrator does any necessary merging -- not sure the integrator and merging is actually necessary if the recording to a view does the dropping of labels itself?

So if you create a counter and have the default SDK installed it will create a view by the same name that keeps all the labels in each recording and uses a sum aggregation.


Hm, thought of another reason I find changing the aggregation on a counter funny. An UpDownCounter instrument is monotonic, but what does this mean outside a sum aggregation? If it was an array aggregation and defined as monotonic a user would expect each value to be greater than the last in the array, but that isn't necessarily the case since the UpDownCounter is guaranteed monotonic only in the case of summing. Just seems odd and actually makes it more complicated if we are expect to have to define aggregations that do and don't make sense depending on the chosen instrument kind.

jkwatson commented 4 years ago

@tsloughter So...it sounds like you are not willing to approve this proposal, then?

I personally don't think we should try to re-do our instrument model before GA.

tsloughter commented 4 years ago

So...it sounds like you are not willing to approve this proposal, then?

I'm leaning that way, yea. But, my review also doesn't count towards the 4 approvals anyway :)

And yea, I don't know if anyone else would get behind my tweaks to the model. It would keep the same API in place, I think, just the SDK would change to describe updating the view's value and checkpointing the view, instead of active instrument.

james-bebbington commented 4 years ago

James Bebbington Can you elaborate on this? I don't understand what this means, or how it applies to a single instrument in a single process.

I was being intentionally vague as it was an overly complex example I was providing (and it doesn't apply to a single process case as it relates to using Correlation Context to pass labels around). I can give more details if you like but don't want to derail the discussion here.

I think there are probably better examples that can be found from OpenCensus, e.g. the request latency instrument is used to record both a count and a distribution (maybe not a big deal as I guess you could just have two instruments for this).

I don't think it really matters anyway as long as we agree this is a use case we want to support eventually and have a path to get there - although I am not actually an Approver :)

I personally don't think we should try to re-do our instrument model before GA.

I do generally support this statement btw, but just want to make sure we at least have a rough plan around how this can be extended to support multiple aggregations per instrument in the future.

james-bebbington commented 4 years ago

Re the question around instrumentation plugins needing to provide default aggregations, here's an example from OC to hopefully explain my thoughts. A typical plugin seems to:

That allows you to capture more data than most people need but not expose it by default. If a user, for example, wants to use your HTTP instrumentation library and include "path" info, they just define their own view.

Even though we don't yet have automatic label injection in OT (presumably this is coming?), I think the same still applies. You may want to write a plugin that can capture a lot of dimensions, but only export some by default - can we support that case?

Btw, I wasn't a contributor on OpenCensus so would be great if someone with more details could jump in 😄

jmacd commented 4 years ago

I believe selecting and configuring labels to inject from the distributed context should be an eventual goal of the Views API. This is something OpenCensus did support, but I was not directly involved.

jmacd commented 4 years ago

@open-telemetry/specs-metrics-approvers Please review if you have not already approved. @jkwatson and reviewers, please close issues that are resolved.

bogdandrutu commented 4 years ago

Would like to see this merged with #128

james-bebbington commented 4 years ago

Spoke with @bogdandrutu offline. I think his comments above alleviate most of the concerns I had around supporting multiple views in the future.

The only question not yet addressed was around supporting instrumentation of libraries that want to provide default "default views" (i.e. default aggregations / label sets) along with the instrumented metrics. There are a couple of possible options for this:

  1. Put the functions described in this OTEP in the OT API so they can be used in libraries (adds a lot of complexity)
  2. Libraries supply a yaml file to specify recommended views that can be read by the OT official SDK
  3. Put functions to define recommended views (not to actually install them) in a separate api extension that can be consumed by OT SDKs
jkwatson commented 4 years ago

Spoke with @bogdandrutu offline. I think his comments above alleviate most of the concerns I had around supporting multiple views in the future.

The only question not yet addressed was around supporting instrumentation of libraries that want to provide default "default views" (i.e. default aggregations / label sets) along with the instrumented metrics. There are a couple of possible options for this:

  1. Put the functions described in this OTEP in the OT API so they can be used in libraries (adds a lot of complexity)
  2. Libraries supply a yaml file to specify recommended views that can be read by the OT official SDK
  3. Put functions to define recommended views (not to actually install them) in a separate api extension that can be consumed by OT SDKs

I don't want this OTEP to address the issue of API-level views. That is a separate concern, and I don't want to muddy the waters at this point.

bogdandrutu commented 4 years ago

@jkwatson how can we make sure this otep and #128 will get us to one place?

jkwatson commented 4 years ago

@jkwatson how can we make sure this otep and #128 will get us to one place?

Fantastic question. 2 options that I see.

1) Have this one merge, and have @tsloughter update #128 to assume this one exists. 2) Have @tsloughter tweak #128 to include the content here as a "phase 1" approach, and have 128 be the PR that gets merged (closing this one).

I'm totally happy either way.

tsloughter commented 4 years ago

I'm fine with either of those as well. Which one is likely unimportant since it is the actual spec that developers looking to implement this will be relying on.

bogdandrutu commented 4 years ago

Decision during the metrics SIG was to merge this OTEP as a starting point and #128 will be changed to build on top of this.