open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.75k stars 889 forks source link

ability to capture all metric raw data without aggregation #617

Open mxiamxia opened 4 years ago

mxiamxia commented 4 years ago

In OT Metric API Instruments Spec, we defined that each type of metric instrument should be binding with a default aggregation behavior / aggregator. In AWS CloudWatch, we support to capture all raw metric measurement events and assemble it as Embedded Metric Format(EMF) in a structured log and so CloudWatch backend can process all raw metric data and provide all metrics visualization capabilities.

We'd like to implement such Exporters for SDK and Collector to support EMF metric format but it requires that we can capture all the raw metric events without aggregating. I see we have Views API WIP to allow customers customizing aggregator. We do want to start the work on this ASAP and looking for the suggestions here if there is a way we can capture all the raw metrics event data(especially not SummaryDataPoint) and pass it to OT collector so we can bring in AWS exporters to support it? Thanks.

cijothomas commented 4 years ago

I believe this would be part of SDK spec. I am also interested in getting a hook in the Metric SDK to get raw metric events and also aggregated metric values. For Traces, the SpanProcessors are the way to get access to Spans. For Metrics, we may need MetricProcessor (which gets metrics after aggregation), and Aggregator (which gets raw metrics)

My use case for this is different: I have a need where same metric need to be aggregated more than once, with different labelsets, and with different aggregation interval too. View may help with former, but not the latter.

Having hooks to get raw events( similar to aggregator) and aggregated events (MetricProcessor) will help me achieve my scenario as well.

jmacd commented 4 years ago

The Aggregator interface that is currently described in the draft SDK specification and in the OTel-Go library does receive raw events. An aggregator that handles raw events and does not compute checkpoints could be referred to as a "pass-through" aggregator. This has been requested, but no prototypes exist.

The metrics SDK describes an "Integrator" which is like the span processor. The integrator receives checkpoints output by the aggregators (one per distinct label set). I believe it is possible to compute multiple aggregations as described without a pass-through aggregator. To achieve multiple aggregations for different label sets, one checkpoint (for one label set) will be merged into multiple output aggregations according to the labels. To achieve multiple intervals, combine shorter aggregations into longer aggregations.

Example: counter x.y.z has labels "a" and "b". Over one collection interval, multiple events are aggregated corresponding to the distinct label sets, leading to intermediate checkpoints:

x.y.z{a=1,b=1} = 250 x.y.z{a=1,b=0} = 250 x.y.z{a=0,b=1} = 250 x.y.z{a=0,b=0} = 250

The collection interval has four aggregators, each with a count of 250 and a distinct label set.

To expose an aggregation over "a", dropping "b", combine two pairs: x.y.z{a=1} = 500 x.y.z{a=0} = 500

Likewise, the aggregation over "b", dropping "a", combine two pairs: x.y.z{b=1} = 500 x.y.z{b=0} = 500

Now suppose this covers a 1 second interval. You'd like to expose a 1 minute interval. Now take the output from the previous step, and combine 60 consecutive intervals together.

As demonstrated (I hope), you shouldn't need a pass-through aggregator to perform multiple aggregations--your goals can be accomplished by merging aggregator checkpoints I think.

The request for a pass-through aggregator would make more sense in an aggregator that computes something derived from time and is not window-oriented, like computing an exponential moving average over the last 5 minutes (regardless of interval).

jmacd commented 4 years ago

@cijothomas Mostly my response above was for your statement. @mxiamxia Given your requirements, it would be perhaps a little simpler to implement your own SDK, since it will do work that you apparently do not want done. I am imagining a simple SDK that just emits a structured log record for every metric event. The default SDK has a lot of machinery for coalescing events that you will not need, so it would be additional expense to use the default SDK.

mxiamxia commented 4 years ago

@cijothomas Mostly my response above was for your statement. @mxiamxia Given your requirements, it would be perhaps a little simpler to implement your own SDK, since it will do work that you apparently do not want done. I am imagining a simple SDK that just emits a structured log record for every metric event. The default SDK has a lot of machinery for coalescing events that you will not need, so it would be additional expense to use the default SDK.

Thanks @jmacd. If we want to have it work with OTel Collector, I assume there is no proper way that we can send the metrics with structured logs in the current metrics data model until Logs data model spec is ready for Collector.

I looked Metrics SDK Draft Spec, if I get it right will the new aggregator Exact be able to export all the raw event data with quantile in a collection interval?

Exact: This aggregator MUST store an array of all recorded measurements. This aggregator MUST support exact quantile computations and it MUST support exporting raw values in the order they were recorded, however it is not required to support both of these modes simultaneously (since computing quantiles requires sorting the measurements).

mxiamxia commented 4 years ago

// TODO: Decide if support for RawMeasurements (measurements recorded using // the synchronous instruments) is necessary. It can be used to delegate the // aggregation from the application to the agent/collector. See // https://github.com/open-telemetry/opentelemetry-specification/issues/617

It will be very nice If we can get the synchronous instruments for raw measurements support in GA. So it will help us to fully support EMF metrics with logs format in the collector and it is also possible the solution for us to support AWS Lambda customers with OT integration

bogdandrutu commented 4 years ago

@jmacd are you ok to support this for GA?

jmacd commented 4 years ago

I am conflicted about a few things here. We have @cnnradams working on an exemplars prototype, and I encouraged him to go a little further than simply capturing trace/span IDs in histogram buckets. As far as exemplars go, we can assert that there are sensible reasons to include exemplars of any measurement when there has been dimensionality reduction, and that we can include statistical weight to make such exemplars more useful. That said, not many consumers of this statistical sample data are asking for this, and there is not a widespread belief that we need this. What's the minimum viable product here? If we just fall back to trace_id and span_id, would we include these in the data points directly? Should we be limited to only one trace_id/span_id per data point? Should we try to include the removed labels when exporting exemplars? Do we care about statistical exemplars?

All of these questions come up when I think about how to represent raw values, since exemplars are practically the same where the statistical count of each exemplar is 1. There has been some debate about how such values should appear in OTLP: there is a concern that raw values will tend to repeat their label sets and that the current shape of the Metric data point does not offer a way to repeatedly use label sets. Does it matter if every raw data point copies the effective label set, or is it important to combine raw values into a group and list their label set only once? If we don't care about this performance/cost concern, then it's stragithforward to represent raw values and exemplars as a repeated RawValue in the Metric, alongside the scalar, Histogram, and Summary repeated values. In this case, if the data point is actually raw data, then the value Type field should reflect it (e.g., as RAW_INT64 or RAW_DOUBLE), whereas if the data point is actually another value type and the raw values are merely exemplars, then the value type should indicate a different type.

@bogdandrutu and @open-telemetry/specs-metrics-approvers please consider. I don't want this issue to block OTLP, but I want to give @cnnradams a target he can hit at the same time.

MrAlias commented 4 years ago

What's the minimum viable product here?

I think this is the important question here.

Does it matter if every raw data point copies the effective label set, or is it important to combine raw values into a group and list their label set only once?

For an MVP, I don't think so. It seems like from what @mxiamxia is asking for it would be satisfied with what you described:

If we don't care about this performance/cost concern, then it's stragithforward to represent raw values and exemplars as a repeated RawValue

It seems to me that if you are asking for "all the data" from the SDK you likely already assume it to be sent in a non-optimized format. My guess is this will be seen as a benefit as the users is looking for full control in all statistical sampling decisions here.

As for trace info:

If we just fall back to trace_id and span_id, would we include these in the data points directly? Should we be limited to only one trace_id/span_id per data point?

Could this be added afterwards in a backwards compatible way? I think it would be useful, but if it blocks the MVP I don't see it being that useful.

bogdandrutu commented 4 years ago

@MrAlias in the protocol we can add any new data type in a backwards compatible way (there is a need to deploy the collector first, but that is true for any distributed system). In terms of the SDK, this is another aggregation type "AllRawValues" so I don't see why cannot be added later.

MrAlias commented 4 years ago

@MrAlias in the protocol we can add any new data type in a backwards compatible way (there is a need to deploy the collector first, but that is true for any distributed system). In terms of the SDK, this is another aggregation type "AllRawValues" so I don't see why cannot be added later.

These are all good points.

IIRC AWS engineers were asking for this. I feel like they should probably be a part of the conversation.

cc @alolita (not sure who else to ping here).

jsuereth commented 3 years ago

From metrics data model sig:

reyang commented 3 years ago

The SDK spec allows processors to have access to the raw data with this PR #1673. Removing the area:sdk tag and leave it for the data model / OTLP.

ramonguiu commented 2 years ago

There is an interesting discussion in CNCF community slack. Some folks have shared some potential workarounds for now. Linking it here in case if may be useful to readers of this issue.

jamielennox commented 1 year ago

The majority of discussion around this happened long before the stabilization of the metrics APIs and the OTLP protocol for metrics. Is there an appetite now for instating something like the Exact (the old name, i don't mind what it's called) aggregator and equivalent OTLP format now and how would we go about it?

At it's core our metrics system doesn't do bucketted histograms and so we're blocked on exporting OTLP from the SDK today which is very limiting. Exporting the complete measurement array is the best way to get the full data into the collector which allows us to write the export in only one place.

reyang commented 1 year ago

I feel this is interesting, I do understand the need, but I don't see a high demand/priority.

@jmacd @jsuereth I know you're busy with other workstreams, just curious if you think this can be achieved with either one of the following approaches?

  1. Hacky way - folks can always write a custom ExemplarFilter and grab all the raw measurements, then do whatever they want.
  2. A bit more systematic - maybe we should introduce the "MeasurementProcessor" concept, such processor will see all the measurements reported by sync/async APIs (metrics via producer API is not flowing through this path). All the existing aggregations and exemplars can be modeled as special types of MeasurementProcessor.
MatisseHack commented 1 year ago

I would really love to see this feature come to fruition! My team maintains a service with a very bursty usage pattern and we loose a lot of fidelity in our metrics because there is no way to turn off aggregation.

Similar to AWS CloudWatch, Azure Application Insights has powerful tools to query and visualize raw metric data. I just wish I could get all my metric data there more easily.