grafana / alloy

OpenTelemetry Collector distribution with programmable pipelines
https://grafana.com/oss/alloy
Apache License 2.0
1.39k stars 203 forks source link

Alloy memory utilization peak by +several GBs for few minutes in "node_exporter/collector.(*filesystemCollector).GetStats" #1485

Open pulchart opened 2 months ago

pulchart commented 2 months ago

What's wrong?

Hello,

I see a huge memory utilization (+ few GB) of Alloy service randomly in my environment. Sometimes it "kill" server with lower amount of RAM (~4-8GB).

I see this pattern in container_memory_rss: container_memory_rss

acording to pyroscope the "issue" is in github.com/prometheus/node_exporter/collector.(*filesystemCollector).GetStats memory-alloc_space

I do not see

Steps to reproduce

run alloy as systemd service with prometheus.exporter.unix component.

System information

CentOS 9 Stream with Upstream linux kernel 6.9.y, 6.10.y,

Software version

Grafana Alloy 1.3

Configuration

prometheus.exporter.unix "node_exporter_system_15s" {
  set_collectors = [
    "btrfs",
    "conntrack",
    "cpu",
    "diskstats",
    "filesystem",
    "loadavg",
    "meminfo",
    "netclass",
    "netdev",
    "nfs",
    "uname",
    "pressure",
    "processes",
    "stat",
    "os",
    "vmstat",
  ]
  include_exporter_metrics = false
  disk {
    device_include = "^((h|s|v|xv)d[a-z]+|nvme\\d+n\\d+)$"
  }
  netclass {
    ignored_devices = "^(cali\\S+|tap\\S+)$"
  }
  netdev {
    device_include = "^(lo|eth\\d+|en\\S+|bond\\d+(|\\.\\S+)|em\\d+|p\\d+p\\d+|br\\S+|k8s\\S+|vxlan\\S+)$"
  }
  filesystem {
    fs_types_exclude = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs|nfs[0-9]*|tmpfs)$"
  }
}

prometheus.scrape "node_exporter_system" {
  forward_to = [prometheus.remote_write.mimir.receiver]
  targets = prometheus.exporter.unix.node_exporter_system.targets
  scrape_interval = "15s"
}

Logs

n/a
pulchart commented 2 months ago

The node_filesystem metrics were not collected during the problematic period. Could the mount point have stalled?node_filesystem_

I see a bug in node_exporter: https://github.com/prometheus/node_exporter/pull/3063

fix filesystem mountTimeout not working,

Could it be related? Looks like mount_timeout option do not work?

pulchart commented 2 months ago

I was able to find a configuration which helps (or workaround?) the memory peaks utilization

I moved the filesystem collector out from others and scraping the metrics less often (15s->60s)

prometheus.exporter.unix "node_exporter_system_15s" {
  set_collectors = [
    "btrfs",
    "conntrack",
    "cpu",
    "diskstats",
    "loadavg",
    "meminfo",
    "netclass",
    "netdev",
    "nfs",
    "pressure",
    "processes",
    "stat",
    "vmstat",
  ]
  include_exporter_metrics = false
  disk {
    device_include = "^((h|s|v|xv)d[a-z]+|nvme\\d+n\\d+)$"
  }
  netclass {
    ignored_devices = "^(cali\\S+|tap\\S+)$"
  }
  netdev {
    device_include = "^(lo|eth\\d+|en\\S+|bond\\d+(|\\.\\S+)|em\\d+|p\\d+p\\d+|br\\S+|k8s\\S+|vxlan\\S+)$"
  }
}

prometheus.exporter.unix "node_exporter_system_60s" {
  set_collectors = [
    "filesystem",
    "uname",
    "os",
  ]
  include_exporter_metrics = false
  filesystem {
    fs_types_exclude = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs|nfs[0-9]*|tmpfs)$"
  }
}

prometheus.scrape "node_exporter_system_15s" {
  forward_to = [prometheus.remote_write.mimir.receiver]
  targets = prometheus.exporter.unix.node_exporter_system_15s.targets
  scrape_interval = "15s"
}

prometheus.scrape "node_exporter_system_60s" {
  forward_to = [prometheus.remote_write.mimir.receiver]
  targets = prometheus.exporter.unix.node_exporter_system_60s.targets
  scrape_interval = "60s"
}
github-actions[bot] commented 4 weeks ago

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it. If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue. The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity. Thank you for your contributions!