kumahq / kuma

🐻 The multi-zone service mesh for containers, Kubernetes and VMs. Built with Envoy. CNCF Sandbox Project.
https://kuma.io/install
Apache License 2.0
3.59k stars 332 forks source link

Otel Exporter panics after a few minutes, complaining about invalid metrics #9336

Open slonka opened 6 months ago

slonka commented 6 months ago

What happened?

Not sure if we are faulty or if it's the Datadog exporter/mapping for otel.

panic: runtime error: index out of range [0] with length 0

goroutine 450 [running]:
github.com/DataDog/opentelemetry-mapping-go/pkg/quantile.(*Agent).InsertInterpolate(0xc001deaf58, 0x414b774000000000, 0x3fe0000000000000, 0x0)
    github.com/DataDog/opentelemetry-mapping-go/pkg/quantile@v0.13.2/agent.go:94 +0x4b4
github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics.(*Translator).getSketchBuckets(0xc002aefb90, {0x911ee78, 0xc002e9d7a0}, {0x7dc81df15470, 0xc001d2e540}, 0xc0020af5c0, {0xc003420c60?, 0xc00206a240?}, {0x0, 0x0, ...}, ...)
    github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics@v0.13.2/metrics_translator.go:351 +0xaf5
github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics.(*Translator).mapHistogramMetrics(0xc002aefb90, {0x911ee78, 0xc002e9d7a0}, {0x90fc310, 0xc001d2e540}, 0x5b3a2273746e696f?, {0xc002149580?, 0xc00206a240?}, 0x0)
    github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics@v0.13.2/metrics_translator.go:515 +0x7c7
github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics.(*Translator).mapToDDFormat(0xc002aefb90, {0x911ee78, 0xc002e9d7a0}, {0xc0024b2640?, 0xc00206a240?}, {0x90fc310?, 0xc001d2e540?}, {0xc001bc6580, 0x1, 0x4}, ...)
    github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics@v0.13.2/metrics_translator.go:847 +0xabe
github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics.(*Translator).MapMetrics(0xc002aefb90, {0x911ee78, 0xc002e9d7a0}, {0xc0031ae000?, 0xc00206a240?}, {0x90fc310?, 0xc001d2e540?})
    github.com/DataDog/opentelemetry-mapping-go/pkg/otlp/metrics@v0.13.2/metrics_translator.go:797 +0xd27
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter.(*metricsExporter).PushMetricsData(0xc002afea20, {0x911ee78, 0xc002e9d7a0}, {0xc0031ae000?, 0xc00206a240?})
    github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter@v0.94.0/metrics_exporter.go:212 +0x21d
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter.(*metricsExporter).PushMetricsDataScrubbed(0xc002afea20, {0x911ee78?, 0xc002e9d7a0?}, {0xc0031ae000?, 0xc00206a240?})
    github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter@v0.94.0/metrics_exporter.go:185 +0x2c
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsRequest).Export(0x0?, {0x911ee78?, 0xc002e9d7a0?})
    go.opentelemetry.io/collector/exporter@v0.94.1/exporterhelper/metrics.go:59 +0x31
go.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send(0xc001bdd980?, {0x911ee78?, 0xc002e9d7a0?}, {0x90d5d50?, 0xc0034429f0?})
    go.opentelemetry.io/collector/exporter@v0.94.1/exporterhelper/timeout_sender.go:43 +0x48
go.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send(0xc00280e8c0?, {0x911ee78?, 0xc002e9d7a0?}, {0x90d5d50?, 0xc0034429f0?})
    go.opentelemetry.io/collector/exporter@v0.94.1/exporterhelper/common.go:35 +0x30
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send(0xc002d8c690, {0x911f350?, 0xc002879af0?}, {0x90d5d50?, 0xc0034429f0?})
    go.opentelemetry.io/collector/exporter@v0.94.1/exporterhelper/metrics.go:171 +0x7e
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1({0x911f350?, 0xc002879af0?}, {0x90d5d50?, 0xc0034429f0?})
    go.opentelemetry.io/collector/exporter@v0.94.1/exporterhelper/queue_sender.go:95 +0x84
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume(0x912a020, 0xc002d8c6f0)
    go.opentelemetry.io/collector/exporter@v0.94.1/internal/queue/bounded_memory_queue.go:57 +0xc7
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1()
    go.opentelemetry.io/collector/exporter@v0.94.1/internal/queue/consumers.go:43 +0x79
created by go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start in goroutine 1
    go.opentelemetry.io/collector/exporter@v0.94.1/internal/queue/consumers.go:39 +0x7d

Repro / setup:

kubectl --context $CTX_CLUSTER3 create namespace observability

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

# otel collector config via helm
cat > otel-config-datadog.yaml <<EOF
mode: deployment
config:
  exporters:
    datadog:
      api:
        site: datadoghq.eu
        key: <key>
  service:
    pipelines:
      logs:
        exporters:
          - datadog
      traces:
        exporters:
          - datadog
      metrics:
        exporters:
          - datadog
EOF

helm upgrade --install \
  --kube-context ${CTX_CLUSTER3} \
  -n observability \
  --set mode=deployment \
  -f otel-config-datadog.yaml \
  opentelemetry-collector open-telemetry/opentelemetry-collector

# enable Metrics
kumactl apply -f - <<EOF
type: MeshMetric
name: metrics-default
mesh: default
spec:
  targetRef:
    kind: Mesh
  # applications:
  #  - name: "backend"
  default:
    backends:
    - type: OpenTelemetry
      openTelemetry: 
        endpoint: "opentelemetry-collector.observability.svc:4317"
EOF
slonka commented 6 months ago

Original author @bcollard

Automaat commented 6 months ago

We can add debug exporter example:

metrics:
  exporters:
    - datadog
    - debug

This will log all collected metrics, so we could find metrics on which datadog exporter fails, and create issue in OpenTelemetry collector. @bcollard could you look at it?

bcollard commented 6 months ago

otel-exporter-2.log otel-exporter-1.log

Otel-collector keeps crashing with the debug exporter for metrics.

Automaat commented 6 months ago

I see that I forgot about rest of the debug exporter config. @bcollard can you run this again with this config:

mode: deployment
config:
  exporters:
    debug:
      verbosity: detailed
    datadog:
      api:
        site: datadoghq.eu
        key: <key>
  service:
    pipelines:
      logs:
        exporters:
          - datadog
      traces:
        exporters:
          - datadog
      metrics:
        exporters:
          - datadog
          - debug

This should properly log collected metrics co we can debug further

bcollard commented 6 months ago

here attached otel-cluster1.log otel-cluster2.log

"kuma" appears a lot in the otel-cluster1.log file, not in the other.

Automaat commented 5 months ago

Logs look fine, but we could also verify if this is only datadog collector issue by pushing metrics to some other saas product like grafana and check if this issue is still there. There is an example on how to set this up in demo-scene repo. Cold out try this without datadog exporter @bcollard ?

github-actions[bot] commented 5 months ago

Removing closed state labels due to the issue being reopened.

github-actions[bot] commented 2 months ago

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant, please comment on it or attend the next triage meeting.