Open pulchart opened 2 months ago
The node_filesystem metrics were not collected during the problematic period. Could the mount point have stalled?
I see a bug in node_exporter: https://github.com/prometheus/node_exporter/pull/3063
fix filesystem mountTimeout not working,
Could it be related? Looks like mount_timeout
option do not work?
I was able to find a configuration which helps (or workaround?) the memory peaks utilization
I moved the filesystem collector out from others and scraping the metrics less often (15s->60s)
prometheus.exporter.unix "node_exporter_system_15s" {
set_collectors = [
"btrfs",
"conntrack",
"cpu",
"diskstats",
"loadavg",
"meminfo",
"netclass",
"netdev",
"nfs",
"pressure",
"processes",
"stat",
"vmstat",
]
include_exporter_metrics = false
disk {
device_include = "^((h|s|v|xv)d[a-z]+|nvme\\d+n\\d+)$"
}
netclass {
ignored_devices = "^(cali\\S+|tap\\S+)$"
}
netdev {
device_include = "^(lo|eth\\d+|en\\S+|bond\\d+(|\\.\\S+)|em\\d+|p\\d+p\\d+|br\\S+|k8s\\S+|vxlan\\S+)$"
}
}
prometheus.exporter.unix "node_exporter_system_60s" {
set_collectors = [
"filesystem",
"uname",
"os",
]
include_exporter_metrics = false
filesystem {
fs_types_exclude = "^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs|nfs[0-9]*|tmpfs)$"
}
}
prometheus.scrape "node_exporter_system_15s" {
forward_to = [prometheus.remote_write.mimir.receiver]
targets = prometheus.exporter.unix.node_exporter_system_15s.targets
scrape_interval = "15s"
}
prometheus.scrape "node_exporter_system_60s" {
forward_to = [prometheus.remote_write.mimir.receiver]
targets = prometheus.exporter.unix.node_exporter_system_60s.targets
scrape_interval = "60s"
}
This issue has not had any activity in the past 30 days, so the needs-attention
label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention
label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!
What's wrong?
Hello,
I see a huge memory utilization (+ few GB) of Alloy service randomly in my environment. Sometimes it "kill" server with lower amount of RAM (~4-8GB).
I see this pattern in
container_memory_rss
:acording to pyroscope the "issue" is in
github.com/prometheus/node_exporter/collector.(*filesystemCollector).GetStats
I do not see
Steps to reproduce
run alloy as systemd service with
prometheus.exporter.unix
component.System information
CentOS 9 Stream with Upstream linux kernel 6.9.y, 6.10.y,
Software version
Grafana Alloy 1.3
Configuration
Logs