open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.69k stars 886 forks source link

Metrics default SDK configuration CFP #382

Open jmacd opened 4 years ago

jmacd commented 4 years ago

This issue is meant to tie together a number of loose threads about configuring a metrics SDK for some of the real-world demands that we know exist. Those are:

  1. Configuring specific "views" of a metric instrument, which implies the ability to select which metric instruments are aggregated, which set(s) of label keys they are pre-aggregated by, and which aggregation(s) are applied. It should be possible to disable metric instruments, to configure the same instrument for multiple aggregations (or multiple sets of label keys). For push-based exporters (specifically OTLP), there is a desire to configure the collection interval on a per-instrument or per-view basis.
  2. Configuring automatic extraction of correlation context for inclusion in metrics aggregation. In some ways distributed correlations are just the same as ordinary labels, in so far as configuring them goes, but we know that they are more expensive to implement and therefore it would be desirable to configure distributed correlation aggregations explicitly. We've discussed use of a boolean to indicate when an aggregation key is non-local to imply that it should be retrieved from the context, not from the LabelSet.
  3. We've discussed a desire to configure trace "exemplars", with two natural forms: (a) configure exporting of span context values for a given aggregation, (b) export a sample of additional labels (i.e., those in the LabelSet or the distributed correlation context that are not part of the aggregation) as exemplars. An example of part (b) would be to configure the top-K most frequent values of a label that was not used for aggregation; ideally the exemplar format would permit including both values and estimated frequencies--for example, when aggregating a sum of "request bytes" by service name, the export the approximate top-10 "host" label values that contributed to each service names's sum.

Configuration should be specified in protobuf format, allowing us to reason about SDK configuration via plain code, via a configuration file, or via a network response. The configuration specification should think about whether these configurations can be changed dynamically or whether they are set once at startup.

There is a separate set of concerns related to configuring metrics export within a stream of trace data. While this is also tracing SDK configuration, it touches on metrics so I'm including it in this issue.

  1. Can we use this specification to configure per-span export of metrics?
  2. Can we configure per-span export of specific metrics? For example, I'd like to include the current (average) CPU load of my process with every span. The instrumentation should not have to be modified, simply use a stateful aggregation of the CPU load and export the last value within each span.

Some related topics were discussed in #259. See related issue #381. See related discussion of exemplars in the OTLP proto https://github.com/open-telemetry/opentelemetry-proto/issues/81.

jkwatson commented 4 years ago

Fantastic writeup, @jmacd . you beat me to it!

jmacd commented 4 years ago

The two issues linked above call for implementation support of the basic mechanisms described here, namely the ability to support multiple aggregations and aggregation by distributed correlations. It will be helpful to prototype the configuration mechanism to understand its feasibility.

tylerbenson commented 4 years ago

I don't want to pass judgement too early, but I have concerns about having all config defined across languages via protobuf. If we can agree that the protobuf defines a core set of configs, but each language will likely have additional config, then I think that would reduce my concern somewhat.

jkwatson commented 4 years ago

I think we should pull this into a single issue (or re-title this one), for general default SDK configuration considerations. Or, would you rather keep this one as-is, and have a separate issue that tracks specifying the general considerations?

That is, we can have a new issue that describes how the default SDK should be configurable, and separate issues for the details of what is configurable for various pieces of the SDK (metrics, traces, exporters, etc).

jmacd commented 4 years ago

I think I agree. There are already some Tracer-related items in this issue, to your point. OTOH, I've seen a desire to keep the metrics and tracing SDKs relatively separate, so they can be mixed and matched.

jkwatson commented 4 years ago

Having a common set of mechanisms for configuration across all the SDK pieces and parts seems important for reducing developer (operator, etc) cognitive load and surprise seems very important to me.

So, perhaps one issue to track the default-SDK(s) configuration mechanisms, and then separate issues for the details of what is configurable for each of the pieces feels like a good separation.

If you're ok with that, I'll go ahead and create the "general configuration mechanisms" issue separate from this one.

jmacd commented 4 years ago

Sounds good.

jkwatson commented 4 years ago

https://github.com/open-telemetry/opentelemetry-specification/issues/390 for configuration mechanism

cijothomas commented 4 years ago

As asked in Metric SIG today, sharing one concrete example of scenario where a metric instrument should be able to be aggregated by multiple aggregators.

Microsoft Azure Monitor has a feature called "Live Metrics" which shows metrics in near real-time - it shows metrics like Requests/Sec, Request/Duration, with 1 sec aggregation, and with limited label/dimensions. (success, servername). The same metrics are also stored with 1 min aggregation with more label/dimensions (response code, url, etc.), for other Metric UI experiences.

To continue providing the same feature with OT, we need the ability of associating Metric instruments to multiple Aggregators.

( In Azure Monitor, all metric update calls go through a chain of processors where every processor gets the metric update call. One of the processors in the chain does 1 sec aggregation with minimal dimensions and the next processor does 1 min aggregation with more dimensions. )

jmacd commented 4 years ago

197 requests a way to configure metric reporting intervals independently.

jmacd commented 4 years ago

See https://github.com/open-telemetry/opentelemetry-proto/pull/155