open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
2.9k stars 2.27k forks source link

Datadog Connector Memory Issues #29755

Closed dineshg13 closed 8 months ago

dineshg13 commented 9 months ago

Component(s)

connector/datadog

What happened?

Description

Customers using Datadog connector at scale have reported Collector memory issues. We are able to replicate the issue with the help of trace dump . The collector using Datadog connector increases memory and OOMs within a few minutes of starting.

Steps to Reproduce

Use the collector config and send the traces down the pipe.

Expected Result

Collector shouldn't OOM.

Actual Result

Collector memory and CPU spike and we are unable to use Datadog Connector at scale.

Collector version

v0.91.0

Environment information

Environment

Latest GKE cluster.

OpenTelemetry Collector configuration

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: "otelcol"
          scrape_interval: 10s
          static_configs:
            - targets: ["0.0.0.0:8888"]
          metric_relabel_configs:
            - source_labels: [__name__]
              regex: ".*grpc_io.*"
              action: drop
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
processors:
  batch:
  groupbyattrs:
    keys:
      - service.name
      - environment
  attributes/env:
    actions:
      - action: upsert
        key: deployment.environment
        value: "${env:DD_SERVICE}"
  attributes/drop:
    include:
      match_type: strict
      resources:
        - key: service.name
        - key: environment
    exclude:
      match_type: regexp
      resources:
        key: ".*"
    actions:
      - action: insert
        key: deployment.environment
        from_attribute: environment
  probabilistic_sampler:
    hash_seed: 22
    sampling_percentage: 1
  resourcedetection:
    detectors: [env, gcp]
    timeout: 2s
    override: false
extensions:
  health_check:
connectors:
  datadog/connector:
    trace_buffer: 500
exporters:
  datadog:
    sending_queue:
      queue_size: 10000
    traces:
      trace_buffer: 500
    metrics:
      resource_attributes_as_tags: true
      histograms:
        mode: "counters"
        send_count_sum_metrics: true
    api:
      key: "${env:DD_API_KEY}"
service:
  extensions:
    - health_check
  telemetry:
    logs:
      initial_fields:
        - service: "otel-collector"
  pipelines:
    metrics:
      receivers: [otlp, datadog/connector, prometheus]
      processors: [resourcedetection, attributes/env, batch]
      exporters: [datadog]
    traces/1:
      receivers: [otlp]
      processors: [attributes/env, groupbyattrs, resourcedetection]
      exporters: [datadog/connector]
    traces/2:
      receivers: [otlp]
      processors: [probabilistic_sampler, attributes/env, resourcedetection, batch]
      exporters: [datadog]

Log output

No response

Additional context

No response

dineshg13 commented 8 months ago

This is resolved via feature gate. See datadog connector readme.

grzn commented 7 months ago

Hi,

Wer'e still seeing memory issues, even with the feature gate enabled.

mariohdoz commented 7 months ago

Hi @grzn,

Can you please give me an example on how do you enable the feature gate? I was looking for an example on how to do that but I didn't find anything.

arielvalentin commented 7 months ago

Hi,

Wer'e still seeing memory issues, even with the feature gate enabled.

Same for us. We've reported our issue directly to DataDog.

grzn commented 7 months ago

It's a command-line parameter to the binary.

We were in v0.7something and it was all good. Now we're trying 0.92 and it's leaking. Going to try 0.82 which is the last version before the processor refactor.

arielvalentin commented 7 months ago

@grzn We didn't have success with the deprecated processor because it does not support computing stats by peer service and span kind.

Once we enable it, we lost the ability to see metrics for inferred services.

grzn commented 7 months ago

I'll push this through our DataDog channels as well.

sirianni commented 7 months ago

Cross-referencing to https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/30828