[exporter/datadog] Memory leak in trace stats module

sirianni commented 9 months ago

Component(s)

exporter/datadog

What happened?

Description

We are observing a memory leak in the Datadog Exporter which appears to be in the trace/apm stats component.

The pattern we see is that when the collector restarts, memory grows steadily until GOMEMLIMIT is reached

We consider this a "leak" vs. a large working set because of the slow growth in heap usage.

Looking at a pprof of the collector process we see the majority of heap is used by the datadog-agent/trace/stats module.

$ go tool pprof -top -cum ~/tmp/mothership-pprof-heap.out
File: collector
Build ID: 38f9c6cdffde4b9f1487841c8b484fa5cf2787fb
Type: inuse_space
Time: Jan 29, 2024 at 9:59am (EST)
Showing nodes accounting for 376.01MB, 93.97% of 400.12MB total
Dropped 263 nodes (cum <= 2MB)
      flat  flat%   sum%        cum   cum%
         0     0%     0%   294.57MB 73.62%  github.com/DataDog/datadog-agent/pkg/trace/stats.(*Concentrator).Add
         0     0%     0%   294.57MB 73.62%  github.com/DataDog/datadog-agent/pkg/trace/stats.(*Concentrator).Run.func1
    0.67MB  0.17%  0.17%   294.57MB 73.62%  github.com/DataDog/datadog-agent/pkg/trace/stats.(*Concentrator).addNow
         0     0%  0.17%   293.90MB 73.45%  github.com/DataDog/datadog-agent/pkg/trace/stats.(*RawBucket).HandleSpan
   23.51MB  5.87%  6.04%   293.90MB 73.45%  github.com/DataDog/datadog-agent/pkg/trace/stats.(*RawBucket).add
         0     0%  6.04%   195.39MB 48.83%  github.com/DataDog/sketches-go/ddsketch.(*DDSketch).Add (inline)
         0     0%  6.04%   195.39MB 48.83%  github.com/DataDog/sketches-go/ddsketch.(*DDSketch).AddWithCount
         0     0%  6.04%   195.39MB 48.83%  github.com/DataDog/sketches-go/ddsketch/store.(*CollapsingLowestDenseStore).AddWithCount
  195.39MB 48.83% 54.88%   195.39MB 48.83%  github.com/DataDog/sketches-go/ddsketch/store.(*CollapsingLowestDenseStore).extendRange
         0     0% 54.88%   195.39MB 48.83%  github.com/DataDog/sketches-go/ddsketch/store.(*CollapsingLowestDenseStore).normalize
   10.50MB  2.62% 57.50%    75.01MB 18.75%  github.com/DataDog/datadog-agent/pkg/trace/stats.newGroupedStats
         0     0% 57.50%    64.50MB 16.12%  github.com/DataDog/sketches-go/ddsketch.LogCollapsingLowestDenseDDSketch
      37MB  9.25% 66.75%       37MB  9.25%  github.com/DataDog/sketches-go/ddsketch/store.NewCollapsingLowestDenseStore (inline)

We have already disabled these flags in the Datadog Exporter.

    traces:
      compute_stats_by_span_kind: false
      peer_tags_aggregation: false

Collector version

v0.90.1

Additional context

When we send the same workload of traces via the Datadog Agent, we do not see a similar memory leak. I thought this was worth noting since my understanding is that the two use the same library to calculate the trace stats.

github-actions[bot] commented 9 months ago

Pinging code owners:

exporter/datadog: @mx-psi @dineshg13 @liustanley @songy23 @mackjmr

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dineshg13 commented 9 months ago

Thanks for reporting. Can you please test this with collector after setting [trace buffer] (https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/4918e96af2447d4b5110d71143eb1a8e1071cb78/exporter/datadogexporter/config.go#L297) in datadog exporter to 1000.

Could also please share the full collector config.

sirianni commented 9 months ago

Our full collector config is massive. Here is the config for the Datadog exporter

  datadog:
    metrics:
      resource_attributes_as_tags: true
      instrumentation_scope_metadata_as_tags: true
      summaries:
        mode: noquantiles
    traces:
      compute_stats_by_span_kind: false
      peer_tags_aggregation: false
      # Buffer traces to avoid "Payload in channel full" errors
      trace_buffer: 10
    host_metadata:
      enabled: false
    sending_queue:
      queue_size: 200

songy23 commented 9 months ago

@sirianni 👋 QQ

When we send the same workload of traces via the Datadog Agent, we do not see a similar memory leak. I thought this was worth noting since my understanding is that the two use the same library to calculate the trace stats.

What's the version of Datadog Agent that you use? You're right Datadog Agent and Datadog exporter use the same library to calculate the trace stats. We want to double check the versions

sirianni commented 9 months ago

We are currently running Datadog Agent 6.45.1.

sirianni commented 9 months ago

Thanks @dineshg13 - I will experiment with the change. Can you help me understand how increasing trace_buffer would affect memory usage of the DD APM stats?

sirianni commented 9 months ago

When we send the same workload of traces via the Datadog Agent, we do not see a similar memory leak.

I'm actually not able to confirm this is the case. We run the DD Agent with larger k8s memory limits so it's possible it's exhibiting a similar issue and we just haven't noticed.

sirianni commented 9 months ago

@dineshg13 setting trace_buffer: 1000 did not fix the leak or show any meaningful improvement in memory usage.

dineshg13 commented 9 months ago

@sirianni Are you using Datadog connector ? It would be helpful to have share the full collector config.

sirianni commented 9 months ago

@sirianni Are you using Datadog connector ? It would be helpful to have share the full collector config.

Yes we are. The full config is quite complicated because we have applications that dual write traces to both our OTel Collector and the DD Agent. We are doing this to allow for a graceful transition from DD Agent to OTel Collector.

What we are trying to accomplish for OpenCensus is:

Convert OpenCensus traces to OTel in the collector
Generate DD APM stats from these converted traces
Send only the APM stats for these traces to DD
We don't send the raw traces to DD because these would cause duplicate SpanIDs (since the applications themselves are dual-exporting to both DD Agent and our collector)

Effectively we're doing something like this.

piplelines:
  traces/opencensus:
    receivers: 
    - opencensus
    exporters:
    - datadog/trace_stats_connector
  traces/otel:
    receivers:
    - otlp
    exporters:
    - datadog
  metrics:
    receivers:
    - datadog/trace_stats_connector
    exporters:
    - datadog

I noticed that the datadogconnector does not support any configuration options compute_stats_by_span_kind or peer_tags_aggregation

dineshg13 commented 9 months ago

@sirianni Thanks for the details. We have fixed datadogconnector to support compute_stats_by_span_kind or peer_tags_aggregation. It should be available in the next release.

Are you using the feature gate in the connector ? We should enable connector.datadogconnector.performance if you are using connector for performance reasons. Please checkout the README

Do you expect to compute stats in ur traces/otel pipeline ?

sirianni commented 9 months ago

Do you expect to compute stats in ur traces/otel pipeline ?

Yes, but these are computed by the datadog exporter. We're only using the datadogconnector for OpenCensus traces because we only want APM metrics for those, and not to export the traces themselves.

sirianni commented 9 months ago

Are you using the feature gate in the connector ?

No, but the feature gate looks to make things more efficient by encoding metrics more efficiently as raw bytes. To be clear, the symptoms here look like a memory leak and not just "high usage". So while enabling the feature gate may make it take longer before the heap is exhausted, it would not seem to fix the root cause of the leak itself.

Also, in our case the heap was consumed by datadog-agent/pkg/trace/stats.(*Concentrator) which wouldn't seem to be affected by the featuregate. As I understand it, the featuregate controls the encoding of metrics to downstream pipelines and the Concentrator is the internal data structure used to maintain the APM stats.

kamal-narayan commented 9 months ago

@dineshg13 Are there fixes other than #31026 identified for this issue?

mackjmr commented 9 months ago

@kamal-narayan, for now this is the only fix that has been shipped. We are still load testing this PR, and will report back here once we have more results.

mackjmr commented 9 months ago

Connector Not in Trace to Trace Pipeline

@sirianni the stats code (datadog-agent/pkg/trace/stats.(*Concentrator)) that we see in the profiles you shared will create aggregation keys/buckets based on env, host, version, service...etc. This means that with a higher cardinality being sent to a collector, we expect a higher memory usage.

We've load tested the collector with multiple cardinalities, but were not able to reproduce the memory leak. The two noteworthy points of the load tests were:

The profiles that we collected match the ones that you have sent above, where the stats code (datadog-agent/pkg/trace/stats.(*Concentrator)) / extendRange is the largest consumer of live heap memory.
The memory of the collector never seems to exhibit a leak. A higher cardinality does lead to a higher memory usage by the stats module, but the memory usage always stabilises.

We consider this a "leak" vs. a large working set because of the slow growth in heap usage.

During the slow growth in heap usage from the screenshot, were there any new hosts being spun up sending data to that collector ? I'm trying to understand if there was an increase in cardinality during that time period, which would lead to the stats module using more memory. In the graphs that you shared, the collector memory seems to stabilise at some point. At this point, did the cardinality being sent to this collector reach it's peak ?

I'm actually not able to confirm this is the case. We run the DD Agent with larger k8s memory limits so it's possible it's exhibiting a similar issue and we just haven't noticed.

Are you able to run a test with the collector by giving it more memory ?

I'm interested in whether the memory stabilises. If you experience symptoms of a memory leak, would you be able to generate new profiles and also output traces via the file exporter ? This would allow us to run our load tests using the traces that you shared to try to reproduce the memory leak.

Connector IN Trace to Trace Pipeline

In this case, we were able to identify and reproduce a memory leak when the connector is used in the trace to trace pipeline. This was fixed in the following following PR, which will be part of the next collector release.

@sirianni Based on the config you shared, you are not using connector in trace to trace pipeline so you should not be affected by this memory leak.

diogotorres97 commented 9 months ago

Still happening here too 😢

mackjmr commented 9 months ago

@diogotorres97 please see: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/30908#issuecomment-1946012544

github-actions[bot] commented 6 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/datadog: @mx-psi @dineshg13 @liustanley @songy23 @mackjmr

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 4 months ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.

open-telemetry / opentelemetry-collector-contrib