open-telemetry / opentelemetry-lambda

Create your own Lambda Layer in each OTel language using this starter code. Add the Lambda Layer to your Lamdba Function to get tracing with OpenTelemetry.
https://opentelemetry.io
Apache License 2.0
270 stars 165 forks source link

Deadline exceed in DataDog exporter #1409

Open 3miliano opened 1 month ago

3miliano commented 1 month ago

Describe the bug I am experiencing a deadline exceed issue on the DataDog exporter, as evidenced by the logs. This issue results in failed export attempts and subsequent retries.

Steps to reproduce

  1. Configure the custom Docker image with the custom collector based on opentelemetry-lambda that includes the DataDog exporter.
  2. Initiate data export (traces, logs, metrics).
  3. Observe the logs for errors related to context deadlines being exceeded.

What did you expect to see? I expected the data to be exported successfully to DataDog without any timeout errors.

What did you see instead? The export requests failed with “context deadline exceeded” errors, resulting in retries and eventual dropping of the payloads. Here are some excerpts from the logs:

1719687286935 {"level":"warn","ts":1719687286.9350078,"caller":"batchprocessor@v0.103.0/batch_processor.go:263","msg":"Sender failed","kind":"processor","name":"batch","pipeline":"logs","error":"no more retries left: Post \"https://http-intake.logs.datadoghq.com/api/v2/logs?ddtags=service%3Akognitos.book.yaml%2Cenv%3Amain%2Cregion%3Aus-west-2%2Ccloud_provider%3Aaws%2Cos.type%3Alinux%2Cotel_source%3Adatadog_exporter\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
1719687286936 {"level":"error","ts":1719687286.9363096,"caller":"datadogexporter@v0.103.0/traces_exporter.go:181","msg":"Error posting hostname/tags series","kind":"exporter","data_type":"traces","name":"datadog","error":"max elapsed time expired Post \"https://api.datadoghq.com/api/v2/series\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","stacktrace":"github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter.(*traceExporter).exportUsageMetrics\n\t/root/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter@v0.103.0/traces_exporter.go:181\ngithub.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter.(*traceExporter).consumeTraces\n\t/root/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter@v0.103.0/traces_exporter.go:139\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesRequest).Export\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/traces.go:59\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/timeout_sender.go:43\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/common.go:37\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/traces.go:159\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/common.go:37\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/common.go:37\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/common.go:294\ngo.opentelemetry.io/collector/exporter/exporterhelper.NewTracesRequestExporter.func1\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/traces.go:134\ngo.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/consumer@v0.103.0/traces.go:25\ngo.opentelemetry.io/collector/processor/batchprocessor.(*batchTraces).export\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.103.0/batch_processor.go:414\ngo.opentelemetry.io/collector/processor/batchprocessor.(*shard).sendItems\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.103.0/batch_processor.go:261\ngo.opentelemetry.io/collector/processor/batchprocessor.(*shard).startLoop\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.103.0/batch_processor.go:223"}

What version of collector/language SDK version did you use? Version: Custom layer-collector/0.8.0 + datadogexporter from v0.103.0

What language layer did you use? Config: None. It is a custom runtime that includes the binary in extensions.

Additional context Here is my configuration file:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "127.0.0.1:4317"
  hostmetrics:
    collection_interval: 60s
    scrapers:
      paging:
        metrics:
          system.paging.utilization:
            enabled: true
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
      disk:
      filesystem:
        metrics:
          system.filesystem.utilization:
            enabled: true
      load:
      memory:
      network:
      processes:

exporters:
  datadog:
    api:
      key: ${secretsmanager:infrastructure/datadog_api_key}
    sending_queue:
      enabled: false
    tls:
      insecure: true
      insecure_skip_verify: true

connectors:
  datadog/connector:

processors:
  resourcedetection:
    detectors: ["lambda", "system"]
    system:
      hostname_sources: ["os"]
  transform:
    log_statements:
      - context: resource
        statements:
          - delete_key(attributes, "service.version")
          - set(attributes["service"], attributes["service.name"])
          - delete_key(attributes, "service.name")
      - context: log
        statements:
          - set(body, attributes["exception.message"]) where attributes["exception.message"] != nil
          - set(attributes["error.stack"], attributes["exception.stacktrace"]) where attributes["exception.stacktrace"] != nil
          - set(attributes["error.message"], attributes["exception.message"]) where attributes["exception.message"] != nil
          - set(attributes["error.kind"], attributes["exception.kind"]) where attributes["exception.kind"] != nil
service:
  telemetry:
    logs:
      level: "debug"
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resourcedetection]
      exporters: [datadog/connector]
    traces/2:
      receivers: [datadog/connector]
      exporters: [datadog]
    metrics:
      receivers: [hostmetrics, otlp]
      processors: [resourcedetection]
      exporters: [datadog]
    logs:
      receivers: [otlp]
      processors: [resourcedetection, transform]
      exporters: [datadog]

Enabling/disabling sending_queue does not seem to do anything to prevent the errors. I did noticed that if I hit the service continuously some traces do get sent, but only a few.

What I discarded as potential solutions:

  1. Connectivity issues. DataDog has an API Key validation API calls that succeeds. If the service is hit constantly some traces get thru.
tylerbenson commented 1 month ago

Any reason you're not using the batch processor? That would probably help.