Open sh0rez opened 1 year ago
+1 - on this. We are observing even more crazy behaviour. We have Otel agent pods running on nodes containing just 5-7 pods and still the usage goes beyond 20GiB for some nodes. There seems no relation to what it is trying to scrape, its literally some uncaught memory leak thats causing this.
I'm actively looking into this and will propose a fix once I found the cause for that
We tried removing memory-limiter and introducing memory_ballast extension instead to avoid dropping data. Does this look like the right approach to save on memory ?
I didn't know about the exporter code, but there's some Github issues about Prometheus consuming a lot of memory to replay wal, may this be related?
I didn't know about the exporter code, but there's some Github issues about Prometheus consuming a lot of memory to replay wal, may this be related?
@nicolastakashi Can you link it?
Update I guess its this one? https://github.com/prometheus/prometheus/issues/6934 - https://github.com/prometheus/prometheus/issues/10750
Seems its coming from here:
github.com/prometheus/prometheus/scrape.(*scrapeLoop).run
github.com/prometheus/prometheus@v0.42.1-0.20230210113933-af1d9e01c7e4/scrape/scrape.go
Total: 0 4.51GB (flat, cum) 86.96%
1262 . 4.51GB ???
@nicolastakashi:
the otel-collector prometheusremotewriteexporter
implements its own write-ahead log in https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/prometheusremotewriteexporter/wal.go that is unrelated to any wal implementations as found in prometheus repositories.
I suspect that renders any prometheus specific discussions less relevant to us sadly
@frzifus:
It surprises me it shows the scrapeLoop
, as I cannot observe any oom behavior when having the WAL disabled (default settings). However, from reading the code, the scapeLoop
should be the same, regardless of wal setting ...
Using a profiler confirmed the indication that wal is the culprit, as one observes a clear leak with wal enabled and none without.
click images to view full flamegraph
However, as observed before, the leaking memory appears to originate from the scrapeLoop
of the receiver.
There must be very non-obvious reason why the wal keeps a hold onto that memory. will keep digging
This is a duplicate of https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/15277 unfortunately.
The following function deadlocks (can't get the mutex):
And this leads to the PushMetrics
call never returning. This means the pipeline is getting backed up, which means we are retaining the samples (produced from scrape
) in the pipeline. This builds up to an OOM.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Hey I think this needs to be resolved
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
This issue has been closed as inactive because it has been stale for 120 days with no activity.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Any updates on this issue? We are also facing same while enabling the WAL the metrics are not being sent to remote destination and getting OOMKilled error.
This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers
. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself.
Describe the bug When running the
prometheusremotewrite
exporter in wal enabled mode under (very) high load (250k active series), it quickly builds up memory until the kernel oom kills otelcolSteps to reproduce
```yml version: '2' services: # generates 250k series to be scaped by otelcol avalanche: image: quay.io/prometheuscommunity/avalanche:main command: - --metric-count=1000 - --series-count=250 - --label-count=5 - --series-interval=3600 - --metric-interval=3600 otel: image: otel/opentelemetry-collector volumes: [otel-cfg:/etc/otelcol] user: 0:0 tmpfs: - /wal depends_on: otel-cfg: condition: service_completed_successfully mem_limit: 8G restart: always otel-cfg: image: alpine volumes: [otel-cfg:/etc/otelcol] command: - sh - -c - | cat - > /etc/otelcol/config.yaml << EOF receivers: prometheus: config: scrape_configs: - job_name: stress scrape_interval: 15s static_configs: - targets: - avalanche:9001 processors: batch: exporters: prometheusremotewrite: endpoint: http://receiver:9090/api/v1/write wal: directory: /wal service: pipelines: metrics: receivers: [prometheus] processors: [batch] exporters: [prometheusremotewrite] EOF # dummy http server to "receive" remote_write samples by always replying with http 200 receiver: image: caddy command: sh -c 'echo ":9090" > /tmp/Caddyfile && exec caddy run --config /tmp/Caddyfile' # prometheus observing resource usage of otelcol prometheus: image: prom/prometheus ports: - 9090:9090 entrypoint: /bin/sh command: - -c - | cat - > prometheus.yml << EOF && /bin/prometheus global: scrape_interval: 5s scrape_configs: - job_name: otel static_configs: - targets: - otel:8888 EOF volumes: otel-cfg: {} ```docker-compose.yml
What did you expect to see? Otelcol having a (high) but (periodically) stable memory usage
What did you see instead?
Otelcol repeatedly builds up memory until it is oom killed by the operating system, only to repeat this exact behavior
What version did you use? Version: Docker
otel/opentelemetry-collector-contrib:0.72.0
What config did you use? See above
docker-compose.yml
Environment
``` Client: Context: default Debug Mode: false Plugins: compose: Docker Compose (Docker Inc., 2.13.0) Server: Containers: 31 Running: 0 Paused: 0 Stopped: 31 Images: 65 Server Version: 20.10.21 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Native Overlay Diff: false userxattr: false Logging Driver: json-file Cgroup Driver: systemd Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc Default Runtime: runc Init Binary: docker-init containerd version: 770bd0108c32f3fb5c73ae1264f7e503fe7b2661.m runc version: init version: de40ad0 Security Options: seccomp Profile: default cgroupns Kernel Version: 5.12.10-arch1-1 Operating System: Arch Linux OSType: linux Architecture: x86_64 CPUs: 12 Total Memory: 15.39GiB Name:docker info
Additional context This only occurs when enabling wal mode. Other prometheus agents (Grafana Agent, Prometheus Agent Mode) do not show this behavior on the exact same input data