Open diranged opened 5 months ago
Pinging code owners:
exporter/prometheusremotewrite: @Aneurysm9 @rapphil
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Oh there is a secondary question here - why does the p99
graph pin out at 10s
... all of my timeouts are far higher than that. I found that pretty suspicious too.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
I believe I have encountered a similar thing. When throttling the output bandwidth of opentelemetry-collector using nftables, I accidentally also applied this to packets being written into a local Prometheus instance.
The otelcontribcol process would rapidly increase in memory usage (before taking down the whole system). It did not happen any more when I disabled the prometheusremotewrite exporter or disabled the firewall rule.
We continue to see this... we're experimenting with a different Prometheus backend, and any time that backend has its push-latency increase into the 100's of ms, the memory footprint on our otel collector pods triples and then they start having to retry and finally start throwing out-of-order write issues (because our primary backed is Amazon AMP).
It isn't clear to me if this is a memory leak, or if you are simply getting extremely close to your memory limit and constantly GC'ing. I suspect the memory_limiter's additional triggering of GCs is similar to the thrashing behavior described in https://tip.golang.org/doc/gc-guide#Memory_limit.
For a memory leak, I would expect to see memory growth over long periods of time, ending with an inevitable OOM even when you have very little throughput.
We're also seeing this with a very similar setup. Our backend is Thanos. We observe that below a certain incoming request rate of data points on the collector (around 5k/second) the queue does not grow. Once the request rate is above 5k/second the queue starts to grow and we run into out of order errors in Thanos.
Here you can see the requests per OTel collector, the queue size for each pod, and the occurrence of out of order requests on the Thanos ingesters.
In this scenario, we had one pod that was completely stuck in a GC loop, and 7 other pods that were "healthy" in that they were able to keep their queue under control. Here's the memory and CPU for the pods:
The stuck pod was constantly GCing:
Showing top 10 nodes out of 80
flat flat% sum% cum cum%
142.84s 17.10% 17.10% 142.84s 17.10% runtime.scanobject /usr/local/go/src/runtime/mgcmark.go:1446
83.31s 9.97% 27.07% 89.78s 10.75% runtime.findObject /usr/local/go/src/runtime/mbitmap.go:1291
66.84s 8.00% 35.07% 66.84s 8.00% runtime.(*mspan).base /usr/local/go/src/runtime/mheap.go:492 (inline)
54.07s 6.47% 41.54% 59.10s 7.07% runtime.(*gcBits).bitp /usr/local/go/src/runtime/mheap.go:2271
40.23s 4.82% 46.36% 40.23s 4.82% runtime.findObject /usr/local/go/src/runtime/mbitmap.go:1279
27.04s 3.24% 49.60% 27.04s 3.24% runtime.(*mspan).heapBitsSmallForAddr /usr/local/go/src/runtime/mbitmap.go:629
25.06s 3.00% 52.60% 25.06s 3.00% runtime.gcDrain /usr/local/go/src/runtime/mgcmark.go:1161
23.31s 2.79% 55.39% 30.10s 3.60% runtime.scanobject /usr/local/go/src/runtime/mgcmark.go:1437
16.53s 1.98% 57.36% 16.53s 1.98% runtime.greyobject /usr/local/go/src/runtime/mgcmark.go:1587
12.52s 1.50% 58.86% 12.52s 1.50% runtime.findObject /usr/local/go/src/runtime/mbitmap.go:1305
Healthy pods were fine:
Showing top 10 nodes out of 278
flat flat% sum% cum cum%
890ms 7.87% 7.87% 890ms 7.87% github.com/prometheus/prometheus/prompb.(*TimeSeries).Size /usr/src/packages/pkg/mod/github.com/prometheus/prometheus@v0.54.1/prompb/types.pb.go:2203
490ms 4.33% 12.20% 490ms 4.33% github.com/prometheus/prometheus/prompb.(*Label).MarshalToSizedBuffer /usr/src/packages/pkg/mod/github.com/prometheus/prometheus@v0.54.1/prompb/types.pb.go:1712
400ms 3.54% 15.74% 400ms 3.54% runtime.scanobject /usr/local/go/src/runtime/mgcmark.go:1446
340ms 3.01% 18.74% 340ms 3.01% runtime.memclrNoHeapPointers /usr/local/go/src/runtime/memclr_amd64.s:127
300ms 2.65% 21.40% 300ms 2.65% compress/flate.(*compressor).reset /usr/local/go/src/compress/flate/deflate.go:615
280ms 2.48% 23.87% 280ms 2.48% github.com/open-telemetry/opentelemetry-collector-contrib/pkg/translator/prometheusremotewrite.ByLabelName.Swap /usr/src/packages/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/pkg/translator/prometheusremotewrite@v0.111.0/helper.go:61
270ms 2.39% 26.26% 270ms 2.39% internal/runtime/syscall.Syscall6 /usr/local/go/src/internal/runtime/syscall/asm_linux_amd64.s:36
220ms 1.95% 28.21% 230ms 2.03% github.com/prometheus/prometheus/prompb.sovTypes /usr/src/packages/pkg/mod/github.com/prometheus/prometheus@v0.54.1/prompb/types.pb.go:2380
150ms 1.33% 29.53% 150ms 1.33% runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s:558
140ms 1.24% 30.77% 140ms 1.24% github.com/prometheus/prometheus/prompb.(*TimeSeries).Size /usr/src/packages/pkg/mod/github.com/prometheus/prometheus@v0.54.1/prompb/types.pb.go:2209
I then deleted the stuck pod. This caused the HPA to scale in the replicaset, as the average CPU load dropped dramatically without the constant GCing. The moment we scaled in, the requests per pod went up, and we immediately started to see the queues grow again without recovery.
Eventually as the queue fills up, we seem to hit memeory limit which triggers the GC cycle and then that's it, the pod is lost.
Component(s)
exporter/prometheusremotewrite
What happened?
Description
I'm reporting what I think is a memory leak in the
prometheusremotewriteexporter
that is triggered when the downstream prometheus endpoint is either slow to respond or failing entirely. This leak ultimately puts the collector into a GC loop that never recovers, causing impact to all the work that the collector is doing, not just the pipeline with the downstream problems. I've spent the last ~2 days troubleshooting this with AWS and talking through it with @Aneurysm9.Steps to Reproduce
In my test case - we have a set of OTEL Collectors called
metric-aggregators
which accept inbound OTLP Metric data (generally sourced from Prometheus Receivers) and write the data into two different pipelines - call it aproduction
and adebug
pipeline. The data going into these pipelines can be the same, or it can be totally unique. In this case, the data is unique... I havedata=foo -> production
anddata=bar -> debug
essentially.Once the pipeline is humming along, introduce intentional throttling to the Prometheus endpoint on the
debug
pipeline - I did this by settingresources.requests.cpu=1
andresources.limits.cpu=1
... and we're writing ~50-80k datapoints/sec, so that was enough to introduce throttling.Expected Result
My expectation is that the
debug
pipeline will start failing requests (_I'd expect to seecontext deadline exceeded
messages) - and data would ultimately be refused by thebatch
processor, which would in turn refuse data upstream. I expect theproduction
pipeline to continue to operate just fine because there's no impact to its downstream targets.Actual Result
Interestingly, we see impact that starts with the
debug
pipeline, but then spreads to all of the pipelines in the collector. After a period of time (~20-40m), the collectors are completely stuck and are in a GC loop triggered by thememory_limiter
. Data then fails to write to theproduction
pipeline. Additionally, when we un-clog the prometheus debug endpoint, the collector doesn't self recover ... it is stuck in this GC loop essentially indefinitely until we restart the pods.Collector version
0.101.0
Environment information
Environment
OS: BottleRocket 1.19.4
OpenTelemetry Collector configuration
Additional context
Setting the Scene
I think the only way to explain the flow here is to start with a picture, and then talk through the timeline. In this picture, we have 5 graphs that are important to see at the same time.
Metric Datapoints Exported
: This is the graph of successful exported metrics per exporter. The blue and yellow lines are two exporters connected to themetrics/prometheus_prod
pipeline. They are sending "production" data that we've validated. The orange line is themetrics/prometheus_beta
pipeline that is sending data we haven't yet validated - but a high volume of it. The green line is a debug output, it can be ignored.Percentage of Metrics Exported
: This is the success-rate graph for each of the exporters described above.HTTP Response Times - ...amazonaws.com
: This is the response time graph for theprometheusremotewrite/amp
exporter (a local AMP endpoint in the same region/account)HTTP Response Times - ...com
: This is the exporterprometheusremotewrite/centralProd
which happens to be an AMP endpoint, but is cross-account and region (going through a proxy).HTTP Response Times - prometheus-operated
: This is the internalprometheusremotewrite/debug
endpoint which is a single internal prometheus pod attached to theprometheus_beta
pipeline that I used to introduce throttling.Timeline
9:40:00
: Everything is roughly humming along just fine....9:44:00
: Throttling is introduced to theprometheusremotewrite/debug
exporter by reducing the CPU limits on the Prometheus pod. Latency starts to creep up.10:07:00
: We finally start to see the success-rate for theprometheusremotewrite/debug
exporter tank. Note at this point the other two exporters are still operating just fine.10:33:00
: We see a dip in the success-rate for theprometheusremotewrite/centralProd
andprometheusremotewrite/amp
endpoints now.10:37:00
: Success rate tanks on all exporters now other than thedebug
exporter.11:12:00
: I un-cork the Prometheus Pod by unlimiting its CPU and letting it restart. We see immediate response to the latencies for theprometheusremotewrite/debug
exporter (though, it does not drop enough, and starts climbing again)Logs
Obviously we have lots of logs ... but here are two graphs that are interesting. First, just the high level graph of
error
log lines:Rather than looking at the logs individually, I started looking at them in terms of two key messages...
Forcing a GC
andout of order
errors:We can see that roughly at
10:03
we start seeing theForcing a GC
message rates start climbing, and at10:07
and10:36
respectively we see corresponding jumps in theout of order
error messages from the upstream Prometheus endpoints. We never see an active recovery of these metrics, even after we un-corked the downstreamprometheusremotewrite/debug
endpoint.Finally - PPROF...
At @Aneurysm9's suggestion, I grabbed a
profile
and a fewheap
dumps from one pod during this time frame:11:04:00
11:11:00
11:14:00
11:26:00
We can see in the CPU profile that most of the time is spent in GC:
When we look at the heap we can see an interesting memory usage in the
prometheusremotewrite
code:Final thoughts
In this scenario, I expect the
batch
processor to prevent data from getting into the pipeline after the initial ~8-16k datapoints are collected and are failing to send. Once they fail to send, I expect that pressure to push upstream all the way to the receiver. I then expect to not really see any memory problems during this outage, just a blockage of the data going to theprometheusremotewrite/debug
endpoint.Instead I believe we see a memory leak in the
prometheusremotewrite
code. When that leak happens, it has the downstream impact of eventually tripping thememory_limiter
circuit breaker which then starts forcing GCs ... but these GCs can't recover the data, so it just happens over and over and over again. This cycle then causes impact to the rest of the data pipeline flowing through the collector.Lastly, I think this memory leak has some critical impact to the data payloads themselves sent to Prometheus which then causes duplicate or out-of-order samples to be sent that normally would not be, and this further exasterbates the problem.