[prometheusremotewriteexporter] memory leak when downstream prometheus endpoint is slow/non-responsive leads to GC and "out of order" errors

diranged commented 5 months ago

Component(s)

exporter/prometheusremotewrite

What happened?

Description

I'm reporting what I think is a memory leak in the prometheusremotewriteexporter that is triggered when the downstream prometheus endpoint is either slow to respond or failing entirely. This leak ultimately puts the collector into a GC loop that never recovers, causing impact to all the work that the collector is doing, not just the pipeline with the downstream problems. I've spent the last ~2 days troubleshooting this with AWS and talking through it with @Aneurysm9.

Steps to Reproduce

In my test case - we have a set of OTEL Collectors called metric-aggregators which accept inbound OTLP Metric data (generally sourced from Prometheus Receivers) and write the data into two different pipelines - call it a production and a debug pipeline. The data going into these pipelines can be the same, or it can be totally unique. In this case, the data is unique... I have data=foo -> production and data=bar -> debug essentially.

Once the pipeline is humming along, introduce intentional throttling to the Prometheus endpoint on the debug pipeline - I did this by setting resources.requests.cpu=1 and resources.limits.cpu=1 ... and we're writing ~50-80k datapoints/sec, so that was enough to introduce throttling.

Expected Result

My expectation is that the debug pipeline will start failing requests (_I'd expect to see context deadline exceeded messages) - and data would ultimately be refused by the batch processor, which would in turn refuse data upstream. I expect the production pipeline to continue to operate just fine because there's no impact to its downstream targets.

Actual Result

Interestingly, we see impact that starts with the debug pipeline, but then spreads to all of the pipelines in the collector. After a period of time (~20-40m), the collectors are completely stuck and are in a GC loop triggered by the memory_limiter. Data then fails to write to the production pipeline. Additionally, when we un-clog the prometheus debug endpoint, the collector doesn't self recover ... it is stuck in this GC loop essentially indefinitely until we restart the pods.

Collector version

0.101.0

Environment information

Environment

OS: BottleRocket 1.19.4

OpenTelemetry Collector configuration

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector-metrics-aggregation
  namespace: otel
spec:
  args:
    feature-gates: +processor.resourcedetection.hostCPUSteppingAsString,+exporter.prometheusremotewritexporter.RetryOn429
  autoscaler:
    behavior:
      scaleDown:
        policies:
        - periodSeconds: 600
          type: Pods
          value: 1
        selectPolicy: Min
        stabilizationWindowSeconds: 900
      scaleUp:
        policies:
        - periodSeconds: 60
          type: Pods
          value: 4
        - periodSeconds: 60
          type: Percent
          value: 100
        selectPolicy: Max
        stabilizationWindowSeconds: 60
    maxReplicas: 24
    minReplicas: 3
    targetCPUUtilization: 60
    targetMemoryUtilization: 65
  config:
    exporters:
      debug:
        sampling_initial: 15
        sampling_thereafter: 60
      debug/verbose:
        sampling_initial: 15
        sampling_thereafter: 60
        verbosity: detailed
      prometheusremotewrite/amp:
        add_metric_suffixes: true
        auth:
          authenticator: sigv4auth
        endpoint: https://.../api/v1/remote_write
        max_batch_size_bytes: "1000000"
        remote_write_queue:
          num_consumers: 5
          queue_size: 50000
        resource_to_telemetry_conversion:
          enabled: true
        retry_on_failure:
          enabled: true
          initial_interval: 200ms
          max_elapsed_time: 60s
          max_interval: 5s
        send_metadata: false
        target_info:
          enabled: false
        timeout: 90s
      prometheusremotewrite/centralProd:
        add_metric_suffixes: true
        endpoint: https://.../api/v1/remote_write
        max_batch_size_bytes: "1000000"
        remote_write_queue:
          num_consumers: 5
          queue_size: 50000
        resource_to_telemetry_conversion:
          enabled: true
        retry_on_failure:
          enabled: true
          initial_interval: 200ms
          max_elapsed_time: 60s
          max_interval: 5s
        send_metadata: false
        target_info:
          enabled: false
        timeout: 90s
        tls:
          ca_file: /tls/ca.crt
          cert_file: /tls/tls.crt
          insecure_skip_verify: true
          key_file: /tls/tls.key
      prometheusremotewrite/debug:
        add_metric_suffixes: true
        endpoint: http://prometheus-operated:9090/api/v1/write
        max_batch_size_bytes: "1000000"
        remote_write_queue:
          num_consumers: 5
          queue_size: 50000
        resource_to_telemetry_conversion:
          enabled: true
        retry_on_failure:
          enabled: true
          initial_interval: 200ms
          max_elapsed_time: 60s
          max_interval: 5s
        send_metadata: false
        target_info:
          enabled: false
        timeout: 90s
        tls:
          insecure: true
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133
      pprof:
        endpoint: :1777
      sigv4auth:
        region: us-west-2
    processors:
      attributes/common:
        actions:
        - action: upsert
          key: k8s.cluster.name
          value: test
      batch/prometheus:
        send_batch_max_size: 16384
        send_batch_size: 8192
        timeout: 5s
      filter/drop_istio_metrics:
        error_mode: ignore
        metrics:
          exclude:
            match_type: regexp
            resource_attributes:
            - key: service.name
              value: istio-system/envoy-stats-monitor
      filter/prometheus_beta:
        error_mode: ignore
        metrics:
          include:
            match_type: regexp
            resource_attributes:
            - key: _meta.level
              value: beta
      filter/prometheus_prod:
        error_mode: ignore
        metrics:
          include:
            match_type: regexp
            resource_attributes:
            - key: _meta.level
              value: prod
      memory_limiter:
        check_interval: 1s
        limit_percentage: 85
        spike_limit_percentage: 10
      transform/set_beta_or_prod_flag:
        error_mode: ignore
        metric_statements:
        - context: resource
          statements:
          - set(attributes["_meta.level"], "beta") where attributes["service.name"]
            == "istio-system/envoy-stats-monitor"
          - set(attributes["_meta.level"], "prod") where attributes["service.name"]
            == "... other metrics ..."
    receivers:
      otlp:
        protocols:
          grpc:
            max_recv_msg_size_mib: 128
            tls:
              ca_file: /tls/ca.crt
              cert_file: /tls/tls.crt
              client_ca_file: /tls/ca.crt
              key_file: /tls/tls.key
    service:
      extensions:
      - health_check
      - pprof
      - sigv4auth
      pipelines:
        metrics/prometheus_beta:
          exporters:
          - debug
          - prometheusremotewrite/debug
          processors:
          - memory_limiter
          - transform/set_beta_or_prod_flag
          - filter/prometheus_beta
          - batch/prometheus
          receivers:
          - otlp
        metrics/prometheus_prod:
          exporters:
          - prometheusremotewrite/amp
          - prometheusremotewrite/centralProd
          processors:
          - memory_limiter
          - transform/set_beta_or_prod_flag
          - filter/prometheus_prod
          - attributes/common
          - batch/prometheus
          receivers:
          - otlp
      telemetry:
        logs:
          level: info
        metrics:
          level: detailed
  daemonSetUpdateStrategy: {}
  deploymentUpdateStrategy: {}
  env:
  - name: GOMAXPROCS
    value: "8"
  - name: GOMEMLIMIT
    valueFrom:
      resourceFieldRef:
        divisor: "1"
        resource: limits.memory
  ingress:
    route: {}
  livenessProbe:
    failureThreshold: 30
    initialDelaySeconds: 60
    periodSeconds: 30
  managementState: managed
  mode: statefulset
  observability:
    metrics: {}
  podDisruptionBudget:
    maxUnavailable: 1
  readinessProbe:
    failureThreshold: 3
    periodSeconds: 15
  replicas: 24
  resources:
    limits:
      memory: 3Gi
    requests:
      cpu: "2"
      memory: 3Gi
  serviceAccount: otel-collector-metrics-aggregation
  upgradeStrategy: automatic
  volumeMounts:
  - mountPath: /tls/ca.crt
    name: tls-ca
    readOnly: true
    subPath: ca.crt
  - mountPath: /tls/tls.key
    name: tls
    readOnly: true
    subPath: tls.key
  - mountPath: /tls/tls.crt
    name: tls
    readOnly: true
    subPath: tls.crt
  volumes:
  - name: tls-ca
    secret:
      defaultMode: 420
      items:
      - key: ca.crt
        path: ca.crt
      secretName: otel-collector-cacert
  - name: tls
    secret:
      defaultMode: 420
      items:
      - key: tls.crt
        path: tls.crt
      - key: tls.key
        path: tls.key
      secretName: otel-collector-metrics-v2

Additional context

Setting the Scene

I think the only way to explain the flow here is to start with a picture, and then talk through the timeline. In this picture, we have 5 graphs that are important to see at the same time.

Metric Datapoints Exported: This is the graph of successful exported metrics per exporter. The blue and yellow lines are two exporters connected to the metrics/prometheus_prod pipeline. They are sending "production" data that we've validated. The orange line is the metrics/prometheus_beta pipeline that is sending data we haven't yet validated - but a high volume of it. The green line is a debug output, it can be ignored.
Percentage of Metrics Exported: This is the success-rate graph for each of the exporters described above.
HTTP Response Times - ...amazonaws.com: This is the response time graph for the prometheusremotewrite/amp exporter (a local AMP endpoint in the same region/account)
HTTP Response Times - ...com: This is the exporter prometheusremotewrite/centralProd which happens to be an AMP endpoint, but is cross-account and region (going through a proxy).
HTTP Response Times - prometheus-operated: This is the internal prometheusremotewrite/debug endpoint which is a single internal prometheus pod attached to the prometheus_beta pipeline that I used to introduce throttling.

Screenshot 2024-05-31 at 12 08 22 PM

Timeline

9:40:00: Everything is roughly humming along just fine....
9:44:00: Throttling is introduced to the prometheusremotewrite/debug exporter by reducing the CPU limits on the Prometheus pod. Latency starts to creep up.
10:07:00: We finally start to see the success-rate for the prometheusremotewrite/debug exporter tank. Note at this point the other two exporters are still operating just fine.
10:33:00: We see a dip in the success-rate for the prometheusremotewrite/centralProd and prometheusremotewrite/amp endpoints now.
10:37:00: Success rate tanks on all exporters now other than the debug exporter.
...
11:12:00: I un-cork the Prometheus Pod by unlimiting its CPU and letting it restart. We see immediate response to the latencies for the prometheusremotewrite/debug exporter (though, it does not drop enough, and starts climbing again)
... recovery never happens automatically...

Logs

Obviously we have lots of logs ... but here are two graphs that are interesting. First, just the high level graph of error log lines:

Rather than looking at the logs individually, I started looking at them in terms of two key messages... Forcing a GC and out of order errors:

We can see that roughly at 10:03 we start seeing the Forcing a GC message rates start climbing, and at 10:07 and 10:36 respectively we see corresponding jumps in the out of order error messages from the upstream Prometheus endpoints. We never see an active recovery of these metrics, even after we un-corked the downstream prometheusremotewrite/debug endpoint.

Finally - PPROF...

At @Aneurysm9's suggestion, I grabbed a profile and a few heap dumps from one pod during this time frame:

profile.pb.gz: Taken at 11:04:00
heap.pb.gz: Taken at 11:11:00
heap-after-fix.gz: Taken after I uncorked prometheus at 11:14:00
heap-after-fix-2.gz: Taken at 11:26:00

We can see in the CPU profile that most of the time is spent in GC:

When we look at the heap we can see an interesting memory usage in the prometheusremotewrite code:

Final thoughts

In this scenario, I expect the batch processor to prevent data from getting into the pipeline after the initial ~8-16k datapoints are collected and are failing to send. Once they fail to send, I expect that pressure to push upstream all the way to the receiver. I then expect to not really see any memory problems during this outage, just a blockage of the data going to the prometheusremotewrite/debug endpoint.

Instead I believe we see a memory leak in the prometheusremotewrite code. When that leak happens, it has the downstream impact of eventually tripping the memory_limiter circuit breaker which then starts forcing GCs ... but these GCs can't recover the data, so it just happens over and over and over again. This cycle then causes impact to the rest of the data pipeline flowing through the collector.

Lastly, I think this memory leak has some critical impact to the data payloads themselves sent to Prometheus which then causes duplicate or out-of-order samples to be sent that normally would not be, and this further exasterbates the problem.

github-actions[bot] commented 5 months ago

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil

See Adding Labels via Comments if you do not have permissions to add labels yourself.

diranged commented 5 months ago

Oh there is a secondary question here - why does the p99 graph pin out at 10s... all of my timeouts are far higher than that. I found that pretty suspicious too.

github-actions[bot] commented 3 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil

See Adding Labels via Comments if you do not have permissions to add labels yourself.

rossng commented 3 months ago

I believe I have encountered a similar thing. When throttling the output bandwidth of opentelemetry-collector using nftables, I accidentally also applied this to packets being written into a local Prometheus instance.

The otelcontribcol process would rapidly increase in memory usage (before taking down the whole system). It did not happen any more when I disabled the prometheusremotewrite exporter or disabled the firewall rule.

diranged commented 2 months ago

We continue to see this... we're experimenting with a different Prometheus backend, and any time that backend has its push-latency increase into the 100's of ms, the memory footprint on our otel collector pods triples and then they start having to retry and finally start throwing out-of-order write issues (because our primary backed is Amazon AMP).

dashpole commented 1 month ago

It isn't clear to me if this is a memory leak, or if you are simply getting extremely close to your memory limit and constantly GC'ing. I suspect the memory_limiter's additional triggering of GCs is similar to the thrashing behavior described in https://tip.golang.org/doc/gc-guide#Memory_limit.

For a memory leak, I would expect to see memory growth over long periods of time, ending with an inevitable OOM even when you have very little throughput.

verejoel commented 2 weeks ago

We're also seeing this with a very similar setup. Our backend is Thanos. We observe that below a certain incoming request rate of data points on the collector (around 5k/second) the queue does not grow. Once the request rate is above 5k/second the queue starts to grow and we run into out of order errors in Thanos.

verejoel commented 2 weeks ago

Here you can see the requests per OTel collector, the queue size for each pod, and the occurrence of out of order requests on the Thanos ingesters.

grafik

In this scenario, we had one pod that was completely stuck in a GC loop, and 7 other pods that were "healthy" in that they were able to keep their queue under control. Here's the memory and CPU for the pods:

grafik

The stuck pod was constantly GCing:

Showing top 10 nodes out of 80
      flat  flat%   sum%        cum   cum%
   142.84s 17.10% 17.10%    142.84s 17.10%  runtime.scanobject /usr/local/go/src/runtime/mgcmark.go:1446
    83.31s  9.97% 27.07%     89.78s 10.75%  runtime.findObject /usr/local/go/src/runtime/mbitmap.go:1291
    66.84s  8.00% 35.07%     66.84s  8.00%  runtime.(*mspan).base /usr/local/go/src/runtime/mheap.go:492 (inline)
    54.07s  6.47% 41.54%     59.10s  7.07%  runtime.(*gcBits).bitp /usr/local/go/src/runtime/mheap.go:2271
    40.23s  4.82% 46.36%     40.23s  4.82%  runtime.findObject /usr/local/go/src/runtime/mbitmap.go:1279
    27.04s  3.24% 49.60%     27.04s  3.24%  runtime.(*mspan).heapBitsSmallForAddr /usr/local/go/src/runtime/mbitmap.go:629
    25.06s  3.00% 52.60%     25.06s  3.00%  runtime.gcDrain /usr/local/go/src/runtime/mgcmark.go:1161
    23.31s  2.79% 55.39%     30.10s  3.60%  runtime.scanobject /usr/local/go/src/runtime/mgcmark.go:1437
    16.53s  1.98% 57.36%     16.53s  1.98%  runtime.greyobject /usr/local/go/src/runtime/mgcmark.go:1587
    12.52s  1.50% 58.86%     12.52s  1.50%  runtime.findObject /usr/local/go/src/runtime/mbitmap.go:1305

Healthy pods were fine:

Showing top 10 nodes out of 278
      flat  flat%   sum%        cum   cum%
     890ms  7.87%  7.87%      890ms  7.87%  github.com/prometheus/prometheus/prompb.(*TimeSeries).Size /usr/src/packages/pkg/mod/github.com/prometheus/prometheus@v0.54.1/prompb/types.pb.go:2203
     490ms  4.33% 12.20%      490ms  4.33%  github.com/prometheus/prometheus/prompb.(*Label).MarshalToSizedBuffer /usr/src/packages/pkg/mod/github.com/prometheus/prometheus@v0.54.1/prompb/types.pb.go:1712
     400ms  3.54% 15.74%      400ms  3.54%  runtime.scanobject /usr/local/go/src/runtime/mgcmark.go:1446
     340ms  3.01% 18.74%      340ms  3.01%  runtime.memclrNoHeapPointers /usr/local/go/src/runtime/memclr_amd64.s:127
     300ms  2.65% 21.40%      300ms  2.65%  compress/flate.(*compressor).reset /usr/local/go/src/compress/flate/deflate.go:615
     280ms  2.48% 23.87%      280ms  2.48%  github.com/open-telemetry/opentelemetry-collector-contrib/pkg/translator/prometheusremotewrite.ByLabelName.Swap /usr/src/packages/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/pkg/translator/prometheusremotewrite@v0.111.0/helper.go:61
     270ms  2.39% 26.26%      270ms  2.39%  internal/runtime/syscall.Syscall6 /usr/local/go/src/internal/runtime/syscall/asm_linux_amd64.s:36
     220ms  1.95% 28.21%      230ms  2.03%  github.com/prometheus/prometheus/prompb.sovTypes /usr/src/packages/pkg/mod/github.com/prometheus/prometheus@v0.54.1/prompb/types.pb.go:2380
     150ms  1.33% 29.53%      150ms  1.33%  runtime.futex /usr/local/go/src/runtime/sys_linux_amd64.s:558
     140ms  1.24% 30.77%      140ms  1.24%  github.com/prometheus/prometheus/prompb.(*TimeSeries).Size /usr/src/packages/pkg/mod/github.com/prometheus/prometheus@v0.54.1/prompb/types.pb.go:2209

I then deleted the stuck pod. This caused the HPA to scale in the replicaset, as the average CPU load dropped dramatically without the constant GCing. The moment we scaled in, the requests per pod went up, and we immediately started to see the queues grow again without recovery.

grafik

Eventually as the queue fills up, we seem to hit memeory limit which triggers the GC cycle and then that's it, the pod is lost.

open-telemetry / opentelemetry-collector-contrib