open-telemetry / opentelemetry-go

OpenTelemetry Go API and SDK
https://opentelemetry.io/docs/languages/go
Apache License 2.0
5.29k stars 1.08k forks source link

Clarify how AWS CloudWatch integrates with opentelemetry-go #83

Closed cep21 closed 4 years ago

cep21 commented 5 years ago

Hi,

I tried writing a cloudwatch exporter for open census and the API there is not rich enough to do CloudWatch metrics well. I document some of the reasons on the README.md at https://github.com/cep21/cwopencensusexporter

I see open telemetry is based upon open census. Will it have the same problems, or will it work well with CloudWatch?

I've looked over a few metrics APIs and it seems like segment.io's handler is expressive enough for cloudwatch metrics because it lets me handle measures higher up the API stack at https://github.com/segmentio/stats/blob/master/handler.go#L15. Will open telemetry allow the same?

Thanks!

rghetia commented 5 years ago

@cep21 thanks for opening this issue. There are two distinct requirements here.

  1. Fixed (but configurable) Reporting Interval.
  2. Report min/max over Reporting Interval instead of life of the metric. This is dependent on the first requirement.

First requirement is strictly how SDK implements it. While the second would require changes to metrics proto to accommodate min/max.

cep21 commented 5 years ago
  1. Fixed (but configurable) Reporting Interval.

It's a bit more complicated than that. A fixed reporting interval is something that will kind of work but not strictly work. To strictly support CloudWatch with as much accuracy as possible, I would need to ensure a datapoint received at 59.999 goes to window 0 and a datapoint received at 60.0001 goes to window 1. This is possible with APIs similar to segment's, but use of open telemetry would sacrifice strict correctness, which is a difficult ask for people that are looking for a metrics abstraction: comparing "good enough" metrics with "strictly correct" metrics.

We need to aggregate by when the metrics come in, not when the code happens to report them.

two distinct requirements

There's another part about preventing pre aggregation with types like metricdata.Summary. I cannot provide a good user experience for this. With an API like segment's I can ignore metricdata.Summary by aggregating on ingestion.

rghetia commented 5 years ago

Thanks for clarifying. It is 'Fixed Aggregation Interval' rather than 'Fixed Reporting Interval'.

Regarding Summary - can you clarify what do you mean by 'ignore metricdata.Summary by aggregating on ingestion.' ?

cep21 commented 5 years ago

can you clarify what do you mean by 'ignore metricdata.Summary by aggregating on ingestion

metricdata.Summary is a lossy aggregation for CloudWatch. A CloudWatch user would never want to aggregate like that. They would prefer aggregating into a distribution so real percentiles can be calculated on the fly. One way to do this is to simply not allow data to ever get into a metricdata.Summary type and to aggregate it on ingestion into a distribution.

jmacd commented 5 years ago

I wrote in Gitter:

The point you raise in issue 83 is very relevant to the SDK-side of the OTel project, which like OC is tailored to the Prometheus model. Other metrics exporters, such as statsd and as you point out, CloudWatch, cannot easily or faithfully use the currently specified default exporter for metrics. We discussed this in the meeting, how some vendors would also like to interpret metrics API calls as events, to be recorded as raw data. We believe the API we're specifying won't limit these implementations, but they'll require different SDK behavior, i.e., cannot be done as exporters. An open question is whether the first release of open telemetry libraries should aim to allow exporting metrics as events (like statsd), along strict time boundaries (like CloudWatch), in addition to the current exporter specification (along variable time boundaries, like Prometheus).

cep21 commented 5 years ago

open question is whether the first release of open telemetry libraries should aim to allow

I feel strongly that the first version should be similar to segment's API, but using terms from open telemetry https://github.com/segmentio/stats/blob/v4.4.1/handler.go#L7. I'll paste it below

type Handler interface{
HandleMeasures(time time.Time, measures ...Measure)
}

People then implement exporters that implement that API. That solves all implementations.

Is it possible to get a conversation around why that is a bad proposal?

rghetia commented 4 years ago

stale issue.