open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.92k stars 2.28k forks source link

Datadog panic's and OTEL collector crashes for empty ExponentialHistogram #26103

Closed lenin-jaganathan closed 1 year ago

lenin-jaganathan commented 1 year ago

Component(s)

exporter/datadog

What happened?

Description

When an empty exponential Histogram is sent with empty buckets, the datadog exporter panic's and crashes the OTEL Collector (using the latest collector-contirb v0.83.0).

Steps to Reproduce

Export a metric datapoint with the below information,

Metric #0
Descriptor:
     -> Name: test_exponential_histogram
     -> Description: Test histogram with empty values
     -> Unit: byte
     -> DataType: ExponentialHistogram
     -> AggregationTemporality: Delta
ExponentialHistogramDataPoints #0
StartTimestamp: 2023-08-25 10:41:00 +0000 UTC
Timestamp: 2023-08-25 10:41:10 +0000 UTC
Count: 0
Sum: 0.000000
Max: 0.000000
Min: 0.000000

Expected Result

Datapoint with value 0 to be sent to the back-end or Datapoint to be dropped gracefully

Actual Result

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x3746eac]

goroutine 110 [running]:
github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics.(*Translator).mapExponentialHistogramMetrics(0xc0004cbe30, {0x922d858, 0xc000d4b5c0}, {0x9215dc0, 0xc000d74ea0}, 0xc0011f30c8, {0x0?}, 0x1)
        github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics@v0.7.0/exponential_histograms_translator.go:158 +0xe2c
github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics.(*Translator).mapToDDFormat(0xc0004cbe30, {0x922d858, 0xc000d4b5c0}, {0x1016350?}, {0x9215dc0?, 0xc000d74ea0?}, {0xc000d9e440, 0x1, 0x4}, {0xc00069d1a0, ...}, ...)
        github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics@v0.7.0/metrics_translator.go:774 +0xc68
github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics.(*Translator).MapMetrics(0xc0004cbe30, {0x922d858, 0xc000d4b5c0}, {0x0?}, {0x9215dc0, 0xc000d74ea0?})
        github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics@v0.7.0/metrics_translator.go:716 +0x97c
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter.(*metricsExporter).PushMetricsData(0xc000c85d40, {0x922d858, 0xc000d4b5c0}, {0xc000de4e10?})
        github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter@v0.83.0/metrics_exporter.go:208 +0x20d
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter.(*metricsExporter).PushMetricsDataScrubbed(0xc000c85d40, {0x922d858?, 0xc000d4b5c0?}, {0x0?})
        github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter@v0.83.0/metrics_exporter.go:181 +0x28
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsRequest).Export(0x0?, {0x922d858?, 0xc000d4b5c0?})
        go.opentelemetry.io/collector/exporter@v0.83.0/exporterhelper/metrics.go:54 +0x34
go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send(0xc000ee2018, {0x9256d30, 0xc000e03560})
        go.opentelemetry.io/collector/exporter@v0.83.0/exporterhelper/common.go:197 +0x96
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send(0xc000ede460, {0x9256d30?, 0xc000e03560?})
        go.opentelemetry.io/collector/exporter@v0.83.0/exporterhelper/queued_retry.go:355 +0x190
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send(0xc000c8ade0, {0x9256d30, 0xc000e03560})
        go.opentelemetry.io/collector/exporter@v0.83.0/exporterhelper/metrics.go:125 +0x88
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1({0x9256d30, 0xc000e03560})
        go.opentelemetry.io/collector/exporter@v0.83.0/exporterhelper/queued_retry.go:195 +0x39
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1()
        go.opentelemetry.io/collector/exporter@v0.83.0/exporterhelper/internal/bounded_memory_queue.go:47 +0xb6
created by go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers
        go.opentelemetry.io/collector/exporter@v0.83.0/exporterhelper/internal/bounded_memory_queue.go:42 +0x45

Collector version

0.83.0

Environment information

Environment

OS: MacOs/Ubuntu

OpenTelemetry Collector configuration

extensions:
  health_check:
  zpages:
    endpoint: 0.0.0.0:55679
receivers:
  otlp:
    protocols:
      grpc:
      http:
  zipkin:
processors:
  batch:
  memory_limiter:
    check_interval: 1s
    limit_mib: 8000
    spike_limit_mib: 2000
exporters:
  logging:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 10
  datadog:
    hostname: somehost
    metrics:
      endpoint: https://somehost
    api:
      key: <key>
    tls:
      insecure_skip_verify: true
service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [logging]
  extensions: [health_check, zpages]

Log output

No response

Additional context

No response

github-actions[bot] commented 1 year ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

mx-psi commented 1 year ago

Thanks for the report @lenin-jaganathan, assigning to me to fix this by dropping empty points.

lenin-jaganathan commented 1 year ago

I am not sure how it generally works in DataDog (also what OpenTelemetry suggests).

Do we need to drop the 0 valued datapoints or ingest them as "0"'s?

mx-psi commented 1 year ago

I am not sure how it generally works in DataDog (also what OpenTelemetry suggests).

For Datadog distributions coming from DogStatsD, dropping would be the closest behavior (IIUC the client library just doesn't produce a point in this case). In the case of OpenTelemetry, I don't think there is a prescription as to what to do here, this is supported for OTLP Histograms (and based on the wording of the spec I would guess also OTLP Exponential Histograms) but the spec generally does not give guidance as to how to map its types other than for well-known systems such as Prometheus.

Do we need to drop the 0 valued datapoints or ingest them as "0"'s?

I believe at present we have no choice but to drop the point. The Datadog distributions payload our intake accepts has an 'average' field, which would have an indeterminate value for this kind of empty points (since it would be $\frac00$). We may be able to change the payload eventually but I want to provide a shorter term solution since a crash is pretty bad.

lenin-jaganathan commented 1 year ago

I want to provide a shorter-term solution since a crash is pretty bad.

On, Full agreement here.

In the case of OpenTelemetry, I don't think there is a prescription as to what to do here

I would hope so since this is a consumer-side concern. I brought it up because I have seen some back-end using different aggregation windows and some have problems with non-continuous data streams. But, If data dog can handle that no real concern here.

mx-psi commented 1 year ago

I brought it up because I have seen some back-end using different aggregation windows and some have problems with non-continuous data streams. But, If data dog can handle that no real concern here.

I don't think these would be an issue in Datadog's case, but I am open to feedback if you find any issues when trying out the fixed version

mx-psi commented 1 year ago

DataDog/opentelemetry-mapping-go/pull/158 fixes this, my intention is for this to be available on v0.84.0 v0.85.0

lenin-jaganathan commented 1 year ago

@mx-psi Did this not make the cut for 0.84?

mx-psi commented 1 year ago

Sorry, v0.85.0 :facepalm:, it was already too late for v0.84.0 when I wrote this message

mx-psi commented 1 year ago

After merging https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/26355 I can confirm the fix will be available on v0.85.0. Apologies for the confusion on my previous message.