exporter/prometheusremotewrite: wal leads to oom under high load

sh0rez commented 1 year ago

Describe the bug When running the prometheusremotewrite exporter in wal enabled mode under (very) high load (250k active series), it quickly builds up memory until the kernel oom kills otelcol

Steps to reproduce

docker-compose.yml

```yml version: '2' services: # generates 250k series to be scaped by otelcol avalanche: image: quay.io/prometheuscommunity/avalanche:main command: - --metric-count=1000 - --series-count=250 - --label-count=5 - --series-interval=3600 - --metric-interval=3600 otel: image: otel/opentelemetry-collector volumes: [otel-cfg:/etc/otelcol] user: 0:0 tmpfs: - /wal depends_on: otel-cfg: condition: service_completed_successfully mem_limit: 8G restart: always otel-cfg: image: alpine volumes: [otel-cfg:/etc/otelcol] command: - sh - -c - | cat - > /etc/otelcol/config.yaml << EOF receivers: prometheus: config: scrape_configs: - job_name: stress scrape_interval: 15s static_configs: - targets: - avalanche:9001 processors: batch: exporters: prometheusremotewrite: endpoint: http://receiver:9090/api/v1/write wal: directory: /wal service: pipelines: metrics: receivers: [prometheus] processors: [batch] exporters: [prometheusremotewrite] EOF # dummy http server to "receive" remote_write samples by always replying with http 200 receiver: image: caddy command: sh -c 'echo ":9090" > /tmp/Caddyfile && exec caddy run --config /tmp/Caddyfile' # prometheus observing resource usage of otelcol prometheus: image: prom/prometheus ports: - 9090:9090 entrypoint: /bin/sh command: - -c - | cat - > prometheus.yml << EOF && /bin/prometheus global: scrape_interval: 5s scrape_configs: - job_name: otel static_configs: - targets: - otel:8888 EOF volumes: otel-cfg: {} ```

What did you expect to see? Otelcol having a (high) but (periodically) stable memory usage

What did you see instead?

Otelcol repeatedly builds up memory until it is oom killed by the operating system, only to repeat this exact behavior

Screenshot from 2023-03-06 23-33-13

What version did you use? Version: Docker otel/opentelemetry-collector-contrib:0.72.0

What config did you use? See above docker-compose.yml

Environment

docker info

``` Client: Context: default Debug Mode: false Plugins: compose: Docker Compose (Docker Inc., 2.13.0) Server: Containers: 31 Running: 0 Paused: 0 Stopped: 31 Images: 65 Server Version: 20.10.21 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: false userxattr: false Logging Driver: json-file Cgroup Driver: systemd Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc Default Runtime: runc Init Binary: docker-init containerd version: 770bd0108c32f3fb5c73ae1264f7e503fe7b2661.m runc version: init version: de40ad0 Security Options: seccomp Profile: default cgroupns Kernel Version: 5.12.10-arch1-1 Operating System: Arch Linux OSType: linux Architecture: x86_64 CPUs: 12 Total Memory: 15.39GiB Name: ID: Docker Root Dir: /var/lib/docker Debug Mode: false Username: Registry: https://index.docker.io/v1/ Labels: Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false ```

Additional context This only occurs when enabling wal mode. Other prometheus agents (Grafana Agent, Prometheus Agent Mode) do not show this behavior on the exact same input data

datsabk commented 1 year ago

+1 - on this. We are observing even more crazy behaviour. We have Otel agent pods running on nodes containing just 5-7 pods and still the usage goes beyond 20GiB for some nodes. There seems no relation to what it is trying to scrape, its literally some uncaught memory leak thats causing this.

sh0rez commented 1 year ago

I'm actively looking into this and will propose a fix once I found the cause for that

datsabk commented 1 year ago

We tried removing memory-limiter and introducing memory_ballast extension instead to avoid dropping data. Does this look like the right approach to save on memory ?

nicolastakashi commented 1 year ago

I didn't know about the exporter code, but there's some Github issues about Prometheus consuming a lot of memory to replay wal, may this be related?

frzifus commented 1 year ago

I didn't know about the exporter code, but there's some Github issues about Prometheus consuming a lot of memory to replay wal, may this be related?

@nicolastakashi Can you link it?

Update I guess its this one? https://github.com/prometheus/prometheus/issues/6934 - https://github.com/prometheus/prometheus/issues/10750

Seems its coming from here:

github.com/prometheus/prometheus/scrape.(*scrapeLoop).run

github.com/prometheus/prometheus@v0.42.1-0.20230210113933-af1d9e01c7e4/scrape/scrape.go

  Total:           0     4.51GB (flat, cum) 86.96%
   1262            .     4.51GB           ???

pprof details

Seems its coming from this scrape_loop: ![grafik](https://user-images.githubusercontent.com/10403402/224716750-78bfbe97-b70a-41f4-b839-36a398fcda7a.png) I [enabled pprof](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/e9977e11ea2db9b9ff563673519a903aa44540a0/extension/pprofextension) to see whats going on. Here is the file [profile.pb.gz](https://github.com/open-telemetry/opentelemetry-collector-contrib/files/10957885/profile.pb.gz), i assume i can continue by the end of this week.

sh0rez commented 1 year ago

@nicolastakashi: the otel-collector prometheusremotewriteexporter implements its own write-ahead log in https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/prometheusremotewriteexporter/wal.go that is unrelated to any wal implementations as found in prometheus repositories.

I suspect that renders any prometheus specific discussions less relevant to us sadly

@frzifus: It surprises me it shows the scrapeLoop, as I cannot observe any oom behavior when having the WAL disabled (default settings). However, from reading the code, the scapeLoop should be the same, regardless of wal setting ...

sh0rez commented 1 year ago

Using a profiler confirmed the indication that wal is the culprit, as one observes a clear leak with wal enabled and none without.

Without:

With:

click images to view full flamegraph

However, as observed before, the leaking memory appears to originate from the scrapeLoop of the receiver.

There must be very non-obvious reason why the wal keeps a hold onto that memory. will keep digging

gouthamve commented 1 year ago

This is a duplicate of https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/15277 unfortunately.

The following function deadlocks (can't get the mutex):

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/cd146d550ac893312aa03fc9c7e6534804606498/exporter/prometheusremotewriteexporter/exporter.go#L177-L182

And this leads to the PushMetrics call never returning. This means the pipeline is getting backed up, which means we are retaining the samples (produced from scrape) in the pipeline. This builds up to an OOM.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil

See Adding Labels via Comments if you do not have permissions to add labels yourself.

Hronom commented 1 year ago

Hey I think this needs to be resolved

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 1 year ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 10 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions[bot] commented 8 months ago

This issue has been closed as inactive because it has been stale for 120 days with no activity.

github-actions[bot] commented 4 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil

See Adding Labels via Comments if you do not have permissions to add labels yourself.

harishkumarrajasekaran commented 3 months ago

Any updates on this issue? We are also facing same while enabling the WAL the metrics are not being sent to remote destination and getting OOMKilled error.

github-actions[bot] commented 1 month ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil @dashpole

See Adding Labels via Comments if you do not have permissions to add labels yourself.

open-telemetry / opentelemetry-collector-contrib

exporter/prometheusremotewrite: wal leads to oom under high load #19363