open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.09k stars 2.38k forks source link

`otelcol-contrib` suddenly creates massive disk i/o #36180

Open jcpunk opened 1 week ago

jcpunk commented 1 week ago

Component(s)

No response

What happened?

Description

The otelcol-contrib binary suddenly is doing a whole lot of disk IO with no real reason/cause

Steps to Reproduce

Unknown, it just sorta happens after the container has been up for a week or so since v0.107.0 (first version I ever used)

Expected Result

Actual Result

Total DISK READ :     272.86 M/s | Total DISK WRITE :     240.21 K/s
Actual DISK READ:     272.39 M/s | Actual DISK WRITE:     272.08 M/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND                              
2564304 be/7 monitor    23.19 M/s    0.00 B/s 88.38 % 84.87 % otelcol-contrib --c~tor/otel-config.yaml
2566875 be/7 monitor    22.57 M/s    0.00 B/s 88.26 % 84.77 % otelcol-contrib --c~tor/otel-config.yaml
2566878 be/7 monitor    23.14 M/s    0.00 B/s 87.83 % 84.57 % otelcol-contrib --c~tor/otel-config.yaml
2569174 be/7 monitor    22.75 M/s    0.00 B/s 87.50 % 84.21 % otelcol-contrib --c~tor/otel-config.yaml
2562628 be/7 monitor    22.76 M/s    0.00 B/s 87.55 % 84.12 % otelcol-contrib --c~tor/otel-config.yaml
2564303 be/7 monitor    22.81 M/s    0.00 B/s 87.38 % 83.85 % otelcol-contrib --c~tor/otel-config.yaml
2559375 be/7 monitor    23.09 M/s    0.00 B/s 87.10 % 83.83 % otelcol-contrib --c~tor/otel-config.yaml
2562623 be/7 monitor    22.39 M/s    0.00 B/s 87.50 % 83.78 % otelcol-contrib --c~tor/otel-config.yaml
2562622 be/7 monitor    22.97 M/s    0.00 B/s 87.07 % 83.72 % otelcol-contrib --c~tor/otel-config.yaml
2566879 be/7 monitor    22.53 M/s    0.00 B/s 86.88 % 83.33 % otelcol-contrib --c~tor/otel-config.yaml
2566874 be/7 monitor    22.86 M/s    0.00 B/s 87.11 % 83.21 % otelcol-contrib --c~tor/otel-config.yaml
2569170 be/7 monitor    21.75 M/s    0.00 B/s 85.48 % 82.03 % otelcol-contrib --c~tor/otel-config.yaml
2569175 be/7 monitor    38.29 K/s    0.00 B/s  0.14 %  0.13 % otelcol-contrib --c~tor/otel-config.yaml
2559388 be/7 monitor     6.96 K/s    0.00 B/s  0.07 %  0.07 % otelcol-contrib --c~tor/otel-config.yaml

Collector version

v0.112.0

Environment information

Environment

OS: Almalinux 9 Podman: ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest

OpenTelemetry Collector configuration

---
processors:
  batch: {}
  transform/hostname:
    metric_statements:
    - context: datapoint
      statements:
      - set(attributes["nodename"], "mu2e-dcs-01.fnal.gov")
      - set(resource.attributes["nodename"], "mu2e-dcs-01.fnal.gov")
exporters:
  prometheus:
    endpoint: "[::]:9299"
    enable_open_metrics: true
    metric_expiration: 2m
service:
  pipelines:
    metrics:
      receivers:
      - prometheus
      processors:
      - transform/hostname
      - batch
      exporters:
      - prometheus
receivers:
  prometheus:
    config:
      scrape_configs:
      - job_name: node-exporter
        scrape_interval: 45s
        static_configs:
        - targets:
          - localhost:9100
          labels:
            instance: mu2e-dcs-01.fnal.gov:9100
      - job_name: systemd-exporter
        scrape_interval: 45s
        static_configs:
        - targets:
          - localhost:9558
          labels:
            instance: mu2e-dcs-01.fnal.gov:9558

Log output

<30>1 2024-11-04T00:12:42.604170-06:00 mu2e-dcs-01.fnal.gov podman 805977 - -             otel-collector.service               2fbf0ebdf1a2 (systemd-otel-collector)               ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest  registry    false
<11>1 2024-11-04T15:24:51.816655-06:00 mu2e-dcs-01.fnal.gov systemd-otel-collector 2559367 - - 2024-11-04T21:24:51.757Z#011warn#011internal/transaction.go:128#011Failed to scrape Prometheus endpoint#011{"kind": "receiver", "name": "prometheus", "data_type": "metrics", "scrape_timestamp": 1730755481211, "target_labels": "{__name__=\"up\", instance=\"mu2e-dcs-01.fnal.gov:9100\", job=\"node-exporter\"}"}

Additional context

[root@mu2e-dcs-01 ~]# cat /etc/containers/systemd/otel-collector.container

[Service]
TimeoutStartSec=900
TimeoutStopSec=30
CPUWeight=30
MemoryMax=256M
IOSchedulingClass=best-effort
IOSchedulingPriority=7
IOWeight=30
Restart=always

[Container]
AutoUpdate=registry
DropCapability=ALL
User=5219
Group=8247
HostName=%H
LogDriver=journald
NoNewPrivileges=true
Pull=missing
ReadOnly=true
Volume=/etc/otel-collector:/etc/otel-collector:ro,rslave,z
Exec=--config /etc/otel-collector/otel-config.yaml

Image=ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest
Network=host
PublishPort=[::]:9299:9299

[Install]
WantedBy=default.target

strace of process for little bit:

[root@mu2e-dcs-01 ~]# strace -f -p 2559369
strace: Process 2559369 attached with 53 threads
[pid 2571898] futex(0xc002400848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569175] futex(0xc00129d248, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569174] futex(0xc004000f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569173] futex(0xc003f81648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569172] futex(0xc003f01648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569171] futex(0xc003e81648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2569169] futex(0xc003d01648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569168] futex(0xc00320d648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569167] futex(0xc003e00848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566881] futex(0xc001381d48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566880] futex(0xc00129c448, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566879] futex(0xc004000848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566878] futex(0xc003f80f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566877] futex(0xc003f00f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566876] futex(0xc003e80f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566875] futex(0xc003d80f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566874] futex(0xc003d00f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566873] futex(0xc00320cf48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564304] futex(0xc003f80848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564303] futex(0xc003f00848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564302] futex(0xc003e80848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564301] futex(0xc003d80848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564300] futex(0xc003d00848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564299] futex(0xc00320c148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564298] futex(0xc001202448, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562628] futex(0xc004000148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562627] futex(0xc003f80148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562626] futex(0xc003f00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562625] futex(0xc003e80148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562624] futex(0xc003e00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562623] futex(0xc003d80148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562622] futex(0xc003d00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559396] futex(0x1000f5b8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559390] futex(0xc002600148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559389] futex(0xc002480148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559387] futex(0xc002400148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559386] futex(0xc002380848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559385] futex(0xc002380148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559384] futex(0xc000147948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559383] futex(0xc001400848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559381] futex(0xc001400148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559380] futex(0xc001300848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559379] futex(0xc001380148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559378] futex(0xc001300148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559377] futex(0xc001280148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559376] futex(0x1000f860, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559375] futex(0xc000180148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559374] futex(0xc000144f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559373] futex(0xc000144848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 2559369] futex(0xffdae20, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... restart_syscall resumed>) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190216951} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 190, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=189637237} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] <... futex resumed>)      = 0
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190156915} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] epoll_pwait(4,  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2559388] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190100964} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190091615} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2559388] epoll_pwait(4,  <unfinished ...>
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190280753} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190496487} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] <... futex resumed>)      = 0
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190536362} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] epoll_pwait(4,  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2559388] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190177556} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2559388] <... futex resumed>)      = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190050015} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559388] <... futex resumed>)      = 0
[pid 2559388] futex(0xc00129a148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] futex(0xffe4680, FUTEX_WAIT_PRIVATE, 0, {tv_sec=0, tv_nsec=190192720} <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 199, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... futex resumed>)      = -1 ETIMEDOUT (Connection timed out)
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 1, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc00129a148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2569170] epoll_pwait(4,  <unfinished ...>

strace a little later

[root@mu2e-dcs-01 ~]# strace -f -p 2559369
strace: Process 2559369 attached with 53 threads
[pid 2571898] futex(0xc002400848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569173] futex(0xc003f81648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569172] futex(0xc003f01648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569171] futex(0xc003e81648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2569169] futex(0xc003d01648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569168] futex(0xc00320d648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569167] futex(0xc003e00848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566881] futex(0xc001381d48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566880] futex(0xc00129c448, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566879] futex(0xc004000848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566877] futex(0xc003f00f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566876] futex(0xc003e80f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2566873] futex(0xc00320cf48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564302] futex(0xc003e80848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564301] futex(0xc003d80848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564300] futex(0xc003d00848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564299] futex(0xc00320c148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2564298] futex(0xc001202448, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562627] futex(0xc003f80148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562626] futex(0xc003f00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562624] futex(0xc003e00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2562622] futex(0xc003d00148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559396] futex(0x1000f5b8, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559390] futex(0xc002600148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559389] futex(0xc002480148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559387] futex(0xc002400148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559386] futex(0xc002380848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559385] futex(0xc002380148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559384] futex(0xc000147948, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559383] futex(0xc001400848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559381] futex(0xc001400148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559380] futex(0xc001300848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559379] futex(0xc001380148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559378] futex(0xc001300148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559377] futex(0xc001280148, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559376] futex(0x1000f860, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559374] futex(0xc000144f48, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559373] futex(0xc000144848, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] restart_syscall(<... resuming interrupted read ...> <unfinished ...>
[pid 2559369] futex(0xffdae20, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 0, NULL, 0) = 0
[pid 2569170] epoll_pwait(4,  <unfinished ...>
[pid 2559372] <... restart_syscall resumed>) = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 5, SIGURG)      = 0
[pid 2559372] getpid()                  = 1
[pid 2559375] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2559372] tgkill(1, 24, SIGURG <unfinished ...>
[pid 2559375] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559372] <... tgkill resumed>)     = 0
[pid 2562625] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2559375] <... rt_sigreturn resumed>) = 825556758272
[pid 2559372] getpid( <unfinished ...>
[pid 2562625] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559372] <... getpid resumed>)     = 1
[pid 2562625] <... rt_sigreturn resumed>) = 825193762800
[pid 2559372] tgkill(1, 52, SIGURG)     = 0
[pid 2569175] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2559372] getpid( <unfinished ...>
[pid 2569175] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559372] <... getpid resumed>)     = 1
[pid 2569175] <... rt_sigreturn resumed>) = 825300078528
[pid 2559372] tgkill(1, 33, SIGURG)     = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 34, SIGURG)     = 0
[pid 2559372] getpid( <unfinished ...>
[pid 2564304] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2559372] <... getpid resumed>)     = 1
[pid 2564304] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559372] tgkill(1, 27, SIGURG <unfinished ...>
[pid 2564304] <... rt_sigreturn resumed>) = 139719454742824
[pid 2559372] <... tgkill resumed>)     = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 37, SIGURG)     = 0
[pid 2559372] getpid( <unfinished ...>
[pid 2566875] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2559372] <... getpid resumed>)     = 1
[pid 2566875] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559372] tgkill(1, 36, SIGURG <unfinished ...>
[pid 2566875] <... rt_sigreturn resumed>) = 139719506348232
[pid 2559372] <... tgkill resumed>)     = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 17, SIGURG)     = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 22, SIGURG)     = 0
[pid 2559372] nanosleep({tv_sec=0, tv_nsec=10000000},  <unfinished ...>
[pid 2569170] <... epoll_pwait resumed>[], 128, 6, NULL, 0) = 0
[pid 2569170] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2569170] futex(0xc003e00148, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 2562624] <... futex resumed>)      = 0
[pid 2562624] epoll_pwait(4, [], 128, 0, NULL, 0) = 0
[pid 2562624] epoll_pwait(4,  <unfinished ...>
[pid 2564303] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2564303] rt_sigreturn({mask=[]})   = 139719544750728
[pid 2566874] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2566874] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2562623] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2562623] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2559388] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2562623] <... rt_sigreturn resumed>) = 826248941504
[pid 2559388] rt_sigreturn({mask=[]} <unfinished ...>
[pid 2566874] <... rt_sigreturn resumed>) = 825087975040
[pid 2559388] <... rt_sigreturn resumed>) = 826291204000
[pid 2562628] --- SIGURG {si_signo=SIGURG, si_code=SI_TKILL, si_pid=1, si_uid=5219} ---
[pid 2562628] rt_sigreturn({mask=[]})   = 827566583712
[pid 2569170] futex(0xc003d81648, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2559372] <... nanosleep resumed>NULL) = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 5, SIGURG)      = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 24, SIGURG)     = 0
[pid 2559372] getpid()                  = 1
[pid 2559372] tgkill(1, 52, SIGURG)     = 0

lsof for dedicated user:

[root@mu2e-dcs-01 ~]# lsof -u monitor
COMMAND       PID    USER   FD      TYPE    DEVICE  SIZE/OFF      NODE NAME
otelcol-c 2559369 monitor  cwd       DIR     0,117        84  33554591 /
otelcol-c 2559369 monitor  rtd       DIR     0,117        84  33554591 /
otelcol-c 2559369 monitor  txt       REG     0,117 263930008 100672926 /otelcol-contrib
otelcol-c 2559369 monitor  mem       REG     253,2           100672926 /otelcol-contrib (stat: No such file or directory)
otelcol-c 2559369 monitor    0u      CHR       1,3       0t0         5 /dev/null
otelcol-c 2559369 monitor    1w     FIFO      0,13       0t0 108577923 pipe
otelcol-c 2559369 monitor    2w     FIFO      0,13       0t0 108577924 pipe
otelcol-c 2559369 monitor    3u     IPv4 108567751       0t0       TCP localhost:ddi-tcp-1 (LISTEN)
otelcol-c 2559369 monitor    4u  a_inode      0,14         0      2079 [eventpoll:3,5,6,7,8]
otelcol-c 2559369 monitor    5u  a_inode      0,14         0      2079 [eventfd:584]
otelcol-c 2559369 monitor    6u     IPv6 145586142       0t0       TCP localhost:56780->localhost:jetdirect (ESTABLISHED)
otelcol-c 2559369 monitor    7u     IPv6 108595225       0t0       TCP *:9299 (LISTEN)
otelcol-c 2559369 monitor    8u     IPv6 108595350       0t0       TCP localhost:33310->localhost:9558 (ESTABLISHED)

podman ps:

[root@mu2e-dcs-01 ~]# podman ps
CONTAINER ID  IMAGE                                                                                           COMMAND               CREATED     STATUS      PORTS       NAMES
353d29fa2f5d  quay.io/prometheuscommunity/systemd-exporter:latest                                             --systemd.collect...  3 days ago  Up 3 days               systemd-prometheus-systemd-exporter
77768f62f72e  quay.io/prometheus/node-exporter:latest                                                         --path.procfs=/ho...  3 days ago  Up 3 days               systemd-prometheus-node-exporter
2fbf0ebdf1a2  ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:latest  --config /etc/ote...  3 days ago  Up 3 days               systemd-otel-collector
atoulme commented 6 days ago

Do you use swap on your machine?

jcpunk commented 6 days ago

The system does have a linux swap partition, but I'm not showing any memory pressure or pages swapped out.

daniel-hoefer commented 4 days ago

@jcpunk According to the iotop usage help a SWAPIN of 88% means that 88% of time was spent on swapping in memory pages from disk. If not from memory pressure then what other explanation could there be for swapped out memory pages?

jcpunk commented 4 days ago

I wonder if it hit the pod memory limit and started swapping rather than oom kill? I'll add the memory_limiter to my config to see if that makes a difference. Not that I can trigger this bug on purpose...