prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
10.92k stars 2.33k forks source link

Bug: collector.arp.netlink is super slow to retrieve metrics #3075

Open maxime1907 opened 1 month ago

maxime1907 commented 1 month ago

Host operating system: output of uname -a

Linux node 6.1.96-flatcar #1 SMP PREEMPT_DYNAMIC Mon Jul  1 23:29:55 -00 2024 x86_64 AMD EPYC 9654 96-Core Processor AuthenticAMD GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.8.1 (branch: HEAD, revision: 400c3979931613db930ea035f39ce7b377cdbb5b)
  build user:       root@7afbff271a3f
  build date:       20240521-18:36:22
  go version:       go1.22.3
  platform:         linux/amd64
  tags:             unknown

node_exporter command line flags

            - '--path.procfs=/host/proc'
            - '--path.sysfs=/host/sys'
            - '--path.rootfs=/host/root'
            - '--path.udev.data=/host/root/run/udev/data'
            - '--web.listen-address=[$(HOST_IP)]:9100'
            - '--collector.textfile.directory=/var/lib/node_exporter'
            - '--no-collector.ipvs'
            - '--collector.netclass.netlink'

node_exporter log output

ts=2024-07-17T14:32:56.130Z caller=node_exporter.go:193 level=info msg="Starting node_exporter" version="(version=1.8.1, branch=HEAD, revision=400c3979931613db930ea035f39ce7b377cdbb5b)"
ts=2024-07-17T14:32:56.131Z caller=node_exporter.go:194 level=info msg="Build context" build_context="(go=go1.22.3, platform=linux/amd64, user=root@7afbff271a3f, date=20240521-18:36:22, tags=unknown)"
ts=2024-07-17T14:32:56.133Z caller=filesystem_common.go:111 level=info collector=filesystem msg="Parsed flag --collector.filesystem.mount-points-exclude" flag=^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+|var/lib/containers/storage/.+)($|/)
ts=2024-07-17T14:32:56.133Z caller=filesystem_common.go:113 level=info collector=filesystem msg="Parsed flag --collector.filesystem.fs-types-exclude" flag=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
ts=2024-07-17T14:32:56.135Z caller=diskstats_common.go:111 level=info collector=diskstats msg="Parsed flag --collector.diskstats.device-exclude" flag=^(z?ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:111 level=info msg="Enabled collectors"
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=arp
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=bcache
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=bonding
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=btrfs
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=conntrack
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=cpu
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=cpufreq
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=diskstats
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=dmi
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=edac
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=entropy
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=fibrechannel
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=filefd
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=filesystem
ts=2024-07-17T14:32:56.138Z caller=node_exporter.go:118 level=info collector=hwmon
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=infiniband
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=loadavg
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=mdadm
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=meminfo
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=netclass
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=netdev
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=netstat
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=nfs
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=nfsd
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=nvme
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=os
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=powersupplyclass
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=pressure
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=rapl
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=schedstat
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=selinux
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=sockstat
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=softnet
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=stat
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=tapestats
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=textfile
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=thermal_zone
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=time
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=timex
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=udp_queues
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=uname
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=vmstat
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=watchdog
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=xfs
ts=2024-07-17T14:32:56.139Z caller=node_exporter.go:118 level=info collector=zfs
ts=2024-07-17T14:32:56.182Z caller=tls_config.go:313 level=info msg="Listening on" address=[::]:9100
ts=2024-07-17T14:32:56.182Z caller=tls_config.go:316 level=info msg="TLS is disabled." http2=false address=[::]:9100

Are you running node_exporter in Docker?

No, we use containerd directly

What did you do that produced an error?

Run node exporter which uses by default the collector collector.arp.netlink

What did you expect to see?

Normal scrape times like 300ms

What did you see instead?

Scrape times of more than 20 seconds

Here is a graph of before and after the deactivation of the arp netlink collector with this flag:

            - '--no-collector.arp.netlink'

image

Here is the output of pprof with:

go tool pprof -svg http://localhost:9100/debug/pprof/profile > node_exporter.svg

node_exporter

chengjoey commented 1 month ago

you are not running in docker, it's the --path.procfs correct? Or is there any error in the log?

maybe it's dial cost too many time https://github.com/prometheus/node_exporter/blob/b9d0932179a0c5b3a8863f3d6cdafe8584cedc8e/collector/arp_linux.go#L79-L83

dswarbrick commented 3 weeks ago

The pprof SVG appears to be truncated, and only shows the collection taking 1.84s (which is admittedly still kinda slow). I can't think of a reason off the top of my head for the rtnetlink variant of the arp collector to be slow. If anything, it should be faster than parsing /proc/net/arp, especially on systems with a large number of ARP entries.

Can you please attach a complete pprof, so that it reveals where that ~1.8s is being consumed by the arpCollector?

The gaps in your Grafana graphs also suggest that you're getting the occasional scrape timeout. Is anything unusual logged when that happens?