prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.24k stars 2.37k forks source link

ethtool: node_ethtool_received_bytes_nic / node_ethtool_transmitted_bytes_nic cause errors about wrong help being logged constantly #2893

Closed frittentheke closed 7 months ago

frittentheke commented 10 months ago

Host operating system: output of uname -a

Linux machinename 6.2.0-39-generic #40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

# node_exporter --version

node_exporter, version 1.7.0 (branch: HEAD, revision: 7333465abf9efba81876303bb57e6fadb946041b)
  build user:       root@35918982f6d8
  build date:       20231112-23:53:35
  go version:       go1.21.4
  platform:         linux/amd64
  tags:             netgo osusergo static_build

node_exporter command line flags

node_exporter \
    --collector.ethtool \
    --collector.ethtool.device-exclude="^(brq|tap|veth|vxlan|virbr|usb).*$" \
    --collector.filesystem \
    --collector.filesystem.fs-types-exclude="^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|ramfs|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$" \
    --collector.ksmd \
    --collector.meminfo_numa \
    --collector.network_route \
    --collector.nvme \
    --collector.netdev \
    --collector.netdev.device-exclude="^(brq|tap|veth|vxlan|virbr|usb).*$" \
    --collector.netclass \
    --collector.netclass.ignored-devices="^(brq|tap|veth|vxlan|virbr|usb).*$" \
    --collector.processes \
    --collector.slabinfo \
    --collector.systemd \
    --collector.systemd.enable-restarts-metrics\
    --collector.systemd.unit-exclude=".+\.(device|automount|mount|path|scope|slice|socket|target)|user.*@.*\.service|.*@tty.*\.service|getty.*\.service|ubuntu-advantage.service|systemd-fsck@.*\.service|ifup@.*\.service|modprobe@.*\.service|motd-news.service|postfix@.*\.service|apt.*\.service|man-db.service|grub-.*\.service|dpkg.*\.service|blk-availability.service|acpid.service|dm-event.service" \
    --collector.textfile \
    --collector.textfile.directory="/var/lib/node_exporter/textfile_collector" \
    --collector.zoneinfo \
    --web.listen-address=0.0.0.0:9100 \
    --web.config.file=/etc/node-exporter/web-config.yml

node_exporter log output

Jan 05 12:09:11 machinename node_exporter[3776]: ts=2024-01-05T12:09:11.160Z caller=stdlib.go:105 level=error msg="error gathering metrics: 4 error(s) occurred:\n* [from Gatherer #2] collected metric node_ethtool_received_bytes_nic label:{name:\"device\"  value:\"eno1\"}  untyped:{value:0} has help \"Network interface rx_bytes_nic\" but should have \"Network interface rx_bytes.nic\"\n* [from Gatherer #2] collected metric node_ethtool_transmitted_bytes_nic label:{name:\"device\"  value:\"eno1\"}  untyped:{value:0} has help \"Network interface tx_bytes_nic\" but should have \"Network interface tx_bytes.nic\"\n* [from Gatherer #2] collected metric node_ethtool_received_bytes_nic label:{name:\"device\"  value:\"eno2\"}  untyped:{value:0} has help \"Network interface rx_bytes_nic\" but should have \"Network interface rx_bytes.nic\"\n* [from Gatherer #2] collected metric node_ethtool_transmitted_bytes_nic label:{name:\"device\"  value:\"eno2\"}  untyped:{value:0} has help \"Network interface tx_bytes_nic\" but should have \"Network interface tx_bytes.nic\""
Jan 05 12:09:18 machinename node_exporter[3776]: ts=2024-01-05T12:09:18.662Z caller=stdlib.go:105 level=error msg="error gathering metrics: 4 error(s) occurred:\n* [from Gatherer #2] collected metric node_ethtool_received_bytes_nic label:{name:\"device\"  value:\"ens2f0np0\"}  untyped:{value:1.6361091e+07} has help \"Network interface rx_bytes.nic\" but should have \"Network interface rx_bytes_nic\"\n* [from Gatherer #2] collected metric node_ethtool_transmitted_bytes_nic label:{name:\"device\"  value:\"ens2f0np0\"}  untyped:{value:13381} has help \"Network interface tx_bytes.nic\" but should have \"Network interface tx_bytes_nic\"\n* [from Gatherer #2] collected metric node_ethtool_received_bytes_nic label:{name:\"device\"  value:\"ens2f1np1\"}  untyped:{value:2.06521801e+08} has help \"Network interface rx_bytes.nic\" but should have \"Network interface rx_bytes_nic\"\n* [from Gatherer #2] collected metric node_ethtool_transmitted_bytes_nic label:{name:\"device\"  value:\"ens2f1np1\"}  untyped:{value:3.80138134e+08} has help \"Network interface tx_bytes.nic\" but should have \"Network interface tx_bytes_nic\""

Are you running node_exporter in Docker?

no

What did you do that produced an error?

I simply ran the exporter as documented above

discordianfish commented 9 months ago

Uhm no idea how this could happen. @SuperQ any ideas?

SuperQ commented 9 months ago

Very possible to have weird data coming from the ethtool syscall has problems. The data is vendor driver dependent, so you never know what they're going to spit out.

We would need a debug dump to see what's going wrong.

frittentheke commented 9 months ago

@SuperQ @discordianfish just let me know what exactly you need / I should gather. This issue here reminds me of the counter spikes (https://github.com/prometheus/node_exporter/issues/1849, see last few comment there). Both issues I observe on machines running Intel E810 nics.

Maybe there actually is something fishy with their ice kernel module? I already am in contact with an Intel dev. Any more info might help them find issues as well.

SuperQ commented 9 months ago

Can you run a debugging exporter and scrape it manually with curl to see what it does?

Something like this:

node_exporter \
    --log.level=debug \
    --collector.ethtool \
    --collector.ethtool.device-exclude="^(brq|tap|veth|vxlan|virbr|usb).*$" \
    --collector.disable-defaults \
    --web.listen-address=127.0.0.1:9101 
frittentheke commented 9 months ago

@SuperQ @discordianfish this is what I got ...

In the meantime the machine was I updated to Linux Kernel 6.5.0 (Ubuntu HWE), but the issue is still there. Linux mymachine 6.5.0-14-generic #14~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Nov 20 18:15:30 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Node-Exporter 1.7.0 output:

# ./node_exporter     --log.level=debug     --collector.ethtool     --collector.ethtool.device-exclude="^(brq|tap|veth|vxlan|virbr|usb).*$"     --collector.disable-defaults     --web.listen-address=127.0.0.1:9101
ts=2024-01-30T12:29:50.483Z caller=node_exporter.go:192 level=info msg="Starting node_exporter" version="(version=1.7.0, branch=HEAD, revision=7333465abf9efba81876303bb57e6fadb946041b)"
ts=2024-01-30T12:29:50.483Z caller=node_exporter.go:193 level=info msg="Build context" build_context="(go=go1.21.4, platform=linux/amd64, user=root@35918982f6d8, date=20231112-23:53:35, tags=netgo osusergo static_build)"
ts=2024-01-30T12:29:50.483Z caller=node_exporter.go:195 level=warn msg="Node Exporter is running as root user. This exporter is designed to run as unprivileged user, root is not required."
ts=2024-01-30T12:29:50.483Z caller=node_exporter.go:198 level=debug msg="Go MAXPROCS" procs=1
ts=2024-01-30T12:29:50.484Z caller=node_exporter.go:110 level=info msg="Enabled collectors"
ts=2024-01-30T12:29:50.484Z caller=node_exporter.go:117 level=info collector=ethtool
ts=2024-01-30T12:29:50.484Z caller=tls_config.go:274 level=info msg="Listening on" address=127.0.0.1:9101
ts=2024-01-30T12:29:50.484Z caller=tls_config.go:277 level=info msg="TLS is disabled." http2=false address=127.0.0.1:9101
ts=2024-01-30T12:29:51.921Z caller=node_exporter.go:78 level=debug msg="collect query:" filters="unsupported value type"
ts=2024-01-30T12:29:51.959Z caller=ethtool_linux.go:398 level=debug collector=ethtool msg="ethtool link info error" err="operation not supported" device=lo errno=95
ts=2024-01-30T12:29:51.959Z caller=ethtool_linux.go:415 level=debug collector=ethtool msg="ethtool driver info error" err="operation not supported" device=lo errno=95
ts=2024-01-30T12:29:51.959Z caller=ethtool_linux.go:431 level=debug collector=ethtool msg="ethtool stats error" err="operation not supported" device=lo errno=95
ts=2024-01-30T12:29:51.959Z caller=ethtool_linux.go:431 level=debug collector=ethtool msg="ethtool stats error" err="operation not supported" device=bond0 errno=95
ts=2024-01-30T12:29:51.972Z caller=ethtool_linux.go:431 level=debug collector=ethtool msg="ethtool stats error" err="operation not supported" device=vlan112 errno=95
ts=2024-01-30T12:29:51.972Z caller=ethtool_linux.go:431 level=debug collector=ethtool msg="ethtool stats error" err="operation not supported" device=vlan113 errno=95
ts=2024-01-30T12:29:51.972Z caller=ethtool_linux.go:431 level=debug collector=ethtool msg="ethtool stats error" err="operation not supported" device=cni-podman0 errno=95
ts=2024-01-30T12:29:51.973Z caller=collector.go:173 level=debug msg="collector succeeded" name=ethtool duration_seconds=0.049086687
ts=2024-01-30T12:29:51.975Z caller=stdlib.go:105 level=error msg="error gathering metrics: 4 error(s) occurred:\n* [from Gatherer #2] collected metric node_ethtool_received_bytes_nic label:{name:\"device\"  value:\"eno1\"}  untyped:{value:0} has help \"Network interface rx_bytes_nic\" but should have \"Network interface rx_bytes.nic\"\n* [from Gatherer #2] collected metric node_ethtool_transmitted_bytes_nic label:{name:\"device\"  value:\"eno1\"}  untyped:{value:0} has help \"Network interface tx_bytes_nic\" but should have \"Network interface tx_bytes.nic\"\n* [from Gatherer #2] collected metric node_ethtool_received_bytes_nic label:{name:\"device\"  value:\"eno2\"}  untyped:{value:0} has help \"Network interface rx_bytes_nic\" but should have \"Network interface rx_bytes.nic\"\n* [from Gatherer #2] collected metric node_ethtool_transmitted_bytes_nic label:{name:\"device\"  value:\"eno2\"}  untyped:{value:0} has help \"Network interface tx_bytes_nic\" but should have \"Network interface tx_bytes.nic\""
ts=2024-01-30T12:30:07.106Z caller=node_exporter.go:78 level=debug msg="collect query:" filters="unsupported value type"
ts=2024-01-30T12:30:07.131Z caller=ethtool_linux.go:431 level=debug collector=ethtool msg="ethtool stats error" err="operation not supported" device=vlan113 errno=95
ts=2024-01-30T12:30:07.132Z caller=ethtool_linux.go:431 level=debug collector=ethtool msg="ethtool stats error" err="operation not supported" device=bond0 errno=95
ts=2024-01-30T12:30:07.147Z caller=ethtool_linux.go:398 level=debug collector=ethtool msg="ethtool link info error" err="operation not supported" device=lo errno=95
ts=2024-01-30T12:30:07.147Z caller=ethtool_linux.go:415 level=debug collector=ethtool msg="ethtool driver info error" err="operation not supported" device=lo errno=95
ts=2024-01-30T12:30:07.147Z caller=ethtool_linux.go:431 level=debug collector=ethtool msg="ethtool stats error" err="operation not supported" device=lo errno=95
ts=2024-01-30T12:30:07.147Z caller=ethtool_linux.go:431 level=debug collector=ethtool msg="ethtool stats error" err="operation not supported" device=vlan112 errno=95
ts=2024-01-30T12:30:07.147Z caller=ethtool_linux.go:431 level=debug collector=ethtool msg="ethtool stats error" err="operation not supported" device=cni-podman0 errno=95
ts=2024-01-30T12:30:07.173Z caller=collector.go:173 level=debug msg="collector succeeded" name=ethtool duration_seconds=0.063994981
ts=2024-01-30T12:30:07.175Z caller=stdlib.go:105 level=error msg="error gathering metrics: 4 error(s) occurred:\n* [from Gatherer #2] collected metric node_ethtool_received_bytes_nic label:{name:\"device\"  value:\"ens2f1np1\"}  untyped:{value:3.353744201736e+12} has help \"Network interface rx_bytes.nic\" but should have \"Network interface rx_bytes_nic\"\n* [from Gatherer #2] collected metric node_ethtool_transmitted_bytes_nic label:{name:\"device\"  value:\"ens2f1np1\"}  untyped:{value:3.343486696742e+12} has help \"Network interface tx_bytes.nic\" but should have \"Network interface tx_bytes_nic\"\n* [from Gatherer #2] collected metric node_ethtool_received_bytes_nic label:{name:\"device\"  value:\"ens2f0np0\"}  untyped:{value:1.4361789929e+10} has help \"Network interface rx_bytes.nic\" but should have \"Network interface rx_bytes_nic\"\n* [from Gatherer #2] collected metric node_ethtool_transmitted_bytes_nic label:{name:\"device\"  value:\"ens2f0np0\"}  untyped:{value:12300} has help \"Network interface tx_bytes.nic\" but should have \"Network interface tx_bytes_nic\""

Here are the metrics curl receives: node-exporter_ethtool_metrics.txt

This is the list of interfaces:

# ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eno1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether ac:1f:6b:bb:75:c0 brd ff:ff:ff:ff:ff:ff
    altname enp1s0f0
    altname ens14f0
3: eno2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether ac:1f:6b:bb:75:c1 brd ff:ff:ff:ff:ff:ff
    altname enp1s0f1
    altname ens14f1
4: ens2f0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether d2:d4:ae:c0:bf:f7 brd ff:ff:ff:ff:ff:ff permaddr 50:7c:6f:55:a8:62
    altname enp2s0f0np0
5: ens2f1np1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP mode DEFAULT group default qlen 1000
    link/ether d2:d4:ae:c0:bf:f7 brd ff:ff:ff:ff:ff:ff permaddr 50:7c:6f:55:a8:63
    altname enp2s0f1np1
6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether d2:d4:ae:c0:bf:f7 brd ff:ff:ff:ff:ff:ff
7: vlan112@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether d2:d4:ae:c0:bf:f7 brd ff:ff:ff:ff:ff:ff
8: vlan113@bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether d2:d4:ae:c0:bf:f7 brd ff:ff:ff:ff:ff:ff
9: cni-podman0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 66:62:8d:36:ba:b5 brd ff:ff:ff:ff:ff:ff
10: veth69dd5bb5@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master cni-podman0 state UP mode DEFAULT group default 
    link/ether 3e:b8:e3:f0:1c:a2 brd ff:ff:ff:ff:ff:ff link-netns cni-13c89489-546a-d189-f509-8de01280bb25

Here please also find lspci and hwinfo (of the network cards):

If there is anything else I could provide, please let me know.

frittentheke commented 9 months ago

@SuperQ @discordianfish may I nag you once again about this issue.

Is there any more info I could provide?

discordianfish commented 9 months ago

I'm really clueless right now.

Here is the more human readable error:

collected metric node_ethtool_received_bytes_nic label:{name:"device" value:"eno1"} untyped:{value:0} has help "Network interface rx_bytes_nic" but should have "Network interface rx_bytes.nic"

...and the metric in question:

# HELP node_ethtool_received_bytes_nic Network interface rx_bytes.nic
# TYPE node_ethtool_received_bytes_nic untyped
node_ethtool_received_bytes_nic{device="ens2f0np0"} 1.4365457353e+10
node_ethtool_received_bytes_nic{device="ens2f1np1"} 3.372554746075e+12

@SuperQ: I assume somehow the help string is set to Network interface rx_bytes_nic on a earlier collection and then changes to Network interface rx_bytes.nic? Or anything else that could cause this error?

frittentheke commented 8 months ago

@discordianfish @SuperQ can I help at all with this issue?

There is (was actually) another issue with the Intel E8xx (ice) driver causing spikes (https://github.com/prometheus/node_exporter/issues/1849), which I communicated to Intel for them to fix, which they now did: https://github.com/prometheus/node_exporter/issues/1849#issuecomment-1968647830

If this issue here is also related to data received from the driver / kernel modules, I'd love to have Intel take a look as well.

SuperQ commented 8 months ago

The output of ethtool -S <device> would be useful.

frittentheke commented 8 months ago

The output of ethtool -S <device> would be useful.

Certainly @SuperQ ... here you go:

ethtool_S__ens2f0np0.txt ethtool_S__ens2f1np1.txt

SuperQ commented 8 months ago

Very strange, it doesn't differ in those files.

I think the only thing we can try to do is apply the same sanitizer rules to the help text.