Closed sirianni closed 4 months ago
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Thanks for reporting. Can you please test this with collector after setting [trace buffer] (https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/4918e96af2447d4b5110d71143eb1a8e1071cb78/exporter/datadogexporter/config.go#L297) in datadog exporter to 1000.
Could also please share the full collector config.
Our full collector config is massive. Here is the config for the Datadog exporter
datadog:
metrics:
resource_attributes_as_tags: true
instrumentation_scope_metadata_as_tags: true
summaries:
mode: noquantiles
traces:
compute_stats_by_span_kind: false
peer_tags_aggregation: false
# Buffer traces to avoid "Payload in channel full" errors
trace_buffer: 10
host_metadata:
enabled: false
sending_queue:
queue_size: 200
@sirianni 👋 QQ
When we send the same workload of traces via the Datadog Agent, we do not see a similar memory leak. I thought this was worth noting since my understanding is that the two use the same library to calculate the trace stats.
What's the version of Datadog Agent that you use? You're right Datadog Agent and Datadog exporter use the same library to calculate the trace stats. We want to double check the versions
We are currently running Datadog Agent 6.45.1
.
Thanks @dineshg13 - I will experiment with the change. Can you help me understand how increasing trace_buffer
would affect memory usage of the DD APM stats?
When we send the same workload of traces via the Datadog Agent, we do not see a similar memory leak.
I'm actually not able to confirm this is the case. We run the DD Agent with larger k8s memory limits so it's possible it's exhibiting a similar issue and we just haven't noticed.
@dineshg13 setting trace_buffer: 1000
did not fix the leak or show any meaningful improvement in memory usage.
@sirianni Are you using Datadog connector ? It would be helpful to have share the full collector config.
@sirianni Are you using Datadog connector ? It would be helpful to have share the full collector config.
Yes we are. The full config is quite complicated because we have applications that dual write traces to both our OTel Collector and the DD Agent. We are doing this to allow for a graceful transition from DD Agent to OTel Collector.
What we are trying to accomplish for OpenCensus is:
Effectively we're doing something like this.
piplelines:
traces/opencensus:
receivers:
- opencensus
exporters:
- datadog/trace_stats_connector
traces/otel:
receivers:
- otlp
exporters:
- datadog
metrics:
receivers:
- datadog/trace_stats_connector
exporters:
- datadog
I noticed that the datadogconnector
does not support any configuration options compute_stats_by_span_kind
or peer_tags_aggregation
@sirianni Thanks for the details. We have fixed datadogconnector to support compute_stats_by_span_kind or peer_tags_aggregation. It should be available in the next release.
Are you using the feature gate in the connector ? We should enable connector.datadogconnector.performance
if you are using connector for performance reasons. Please checkout the README
Do you expect to compute stats in ur traces/otel pipeline ?
Do you expect to compute stats in ur traces/otel pipeline ?
Yes, but these are computed by the datadog
exporter. We're only using the datadogconnector
for OpenCensus traces because we only want APM metrics for those, and not to export the traces themselves.
Are you using the feature gate in the connector ?
No, but the feature gate looks to make things more efficient by encoding metrics more efficiently as raw bytes. To be clear, the symptoms here look like a memory leak and not just "high usage". So while enabling the feature gate may make it take longer before the heap is exhausted, it would not seem to fix the root cause of the leak itself.
Also, in our case the heap was consumed by datadog-agent/pkg/trace/stats.(*Concentrator)
which wouldn't seem to be affected by the featuregate. As I understand it, the featuregate controls the encoding of metrics to downstream pipelines and the Concentrator
is the internal data structure used to maintain the APM stats.
@dineshg13 Are there fixes other than #31026 identified for this issue?
@kamal-narayan, for now this is the only fix that has been shipped. We are still load testing this PR, and will report back here once we have more results.
Connector Not in Trace to Trace Pipeline
@sirianni the stats code (datadog-agent/pkg/trace/stats.(*Concentrator)
) that we see in the profiles you shared will create aggregation keys/buckets based on env
, host
, version
, service
...etc. This means that with a higher cardinality being sent to a collector, we expect a higher memory usage.
We've load tested the collector with multiple cardinalities, but were not able to reproduce the memory leak. The two noteworthy points of the load tests were:
datadog-agent/pkg/trace/stats.(*Concentrator)
) / extendRange
is the largest consumer of live heap memory. We consider this a "leak" vs. a large working set because of the slow growth in heap usage.
During the slow growth in heap usage from the screenshot, were there any new hosts being spun up sending data to that collector ? I'm trying to understand if there was an increase in cardinality during that time period, which would lead to the stats module using more memory. In the graphs that you shared, the collector memory seems to stabilise at some point. At this point, did the cardinality being sent to this collector reach it's peak ?
I'm actually not able to confirm this is the case. We run the DD Agent with larger k8s memory limits so it's possible it's exhibiting a similar issue and we just haven't noticed.
Are you able to run a test with the collector by giving it more memory ?
I'm interested in whether the memory stabilises. If you experience symptoms of a memory leak, would you be able to generate new profiles and also output traces via the file exporter ? This would allow us to run our load tests using the traces that you shared to try to reproduce the memory leak.
Connector IN Trace to Trace Pipeline
In this case, we were able to identify and reproduce a memory leak when the connector is used in the trace to trace pipeline. This was fixed in the following following PR, which will be part of the next collector release.
@sirianni Based on the config you shared, you are not using connector in trace to trace pipeline so you should not be affected by this memory leak.
Still happening here too 😢
@diogotorres97 please see: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/30908#issuecomment-1946012544
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been closed as inactive because it has been stale for 120 days with no activity.
Component(s)
exporter/datadog
What happened?
Description
We are observing a memory leak in the Datadog Exporter which appears to be in the trace/apm stats component.
The pattern we see is that when the collector restarts, memory grows steadily until
GOMEMLIMIT
is reachedWe consider this a "leak" vs. a large working set because of the slow growth in heap usage.
Looking at a
pprof
of the collector process we see the majority of heap is used by thedatadog-agent/trace/stats
module.We have already disabled these flags in the Datadog Exporter.
Collector version
v0.90.1
Additional context
Seems similar to https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/15720
When we send the same workload of traces via the Datadog Agent, we do not see a similar memory leak. I thought this was worth noting since my understanding is that the two use the same library to calculate the trace stats.