prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.3k stars 2.38k forks source link

Proposal: Extend network bonding metrics #1604

Open hoffie opened 4 years ago

hoffie commented 4 years ago

node_exporter currently exposes details about network bonding, which is great. To be able to monitor more failure cases, we would need additional metrics which we haven't found in node_exporter yet:

Essentially, we would need the following metrics:

# HELP node_bonding_slave_info Contains association between bonding master device and associated slaves. Value is always 1.
# TYPE node_bonding_slave_info GAUGE
## from /sys/class/net/<master>/bonding/slaves
# - alert: LinuxPlatformBondMixedPortSpeeds
#   expr: min(node_bonding_slave_info * on(instance, device) group_left() node_network_speed_bytes) by (instance, master) != max(node_bonding_slave_info * on(instance, device) group_left() node_network_speed_bytes) by (instance, master) and on(instance, master) node_bonding_active == node_bonding_slaves
node_bonding_slave_info{master="bond0",device="eno1"} 1
node_bonding_slave_info{master="bond0",device="eno2"} 1

# HELP node_bonding_ad_num_ports Number of visible LACP/AD ports
# TYPE node_bonding_ad_num_ports GAUGE'
## from /sys/devices/virtual/net/<master>/bonding/ad_num_ports
## - alert: BondLacpPortsDegraded
##   expr: node_bonding_ad_num_ports != node_bonding_slaves
node_bonding_ad_num_ports{master="bond0"} 2

It could also make sense to add some more information in the same go. So far, we haven't required these in our alerting, but they may be useful nevertheless:

# HELP node_bonding_mode Kind of bonding (e.g. 1 for active-backup, 4 for lacp/ad)
# TYPE node_bonding_mode GAUGE
## from /sys/class/net/<master>/bonding/mode
## might be useful to have different monitoring for active-backup, lacp, etc.
node_bonding_mode{master="bond0"} 4

# HELP node_bonding_ad_info Contains the LACP/AD partner_mac. Value is always 1.
# TYPE node_bonding_ad_info GAUGE
## from /sys/devices/virtual/net/<master>/bonding/ad_partner_mac)
## might be useful for detecting bad lacp configs (zero MAC) or building stats about switch MACs
node_bonding_ad_info{master="bond0",partner_mac="00:00:00:00:00"} 1

# HELP node_bonding_ad_port_partner_oper_port_state Bitfield denoting LACP/AD port state for the partner
# TYPE node_bonding_ad_port_partner_oper_port_state GAUGE
## from /sys/class/net/bond0/slave_eno1/bonding_slave/ad_partner_oper_port_state
node_bonding_ad_port_partner_oper_port_state{master="bond0",device="eno1"} 61
node_bonding_ad_port_partner_oper_port_state{master="bond0",device="eno2"} 61

# HELP node_bonding_ad_port_actor_oper_port_state Bitfield denoting LACP/AD port state for the actor
# TYPE node_bonding_ad_port_actor_oper_port_state GAUGE
node_bonding_ad_port_actor_oper_port_state{master="bond0",device="eno1"} 61
node_bonding_ad_port_actor_oper_port_state{master="bond0",device="eno2"} 61

We currently use a shell script and node_exporter's textfile collector to fill this gap. However, I think it would be useful to support these metrics out-of-the box, especially since only /sys files need to be read.

I'd volunteer to work on PRs against procfs and node_exporter. I would suggest adding this to the existing bonding_linux.go as it is closely related. Looks like this would also imply converting the existing bonding collector to procfs.

Related: #841

@SuperQ @discordianfish @pgier What do you think? Does it make sense in general? Should we only implement the first two metrics or the more generic approach? Do the suggested names make sense?

SuperQ commented 4 years ago

Most of that makes sense to me. The only thing that I would probably leave out is the node_bonding_ad_info. The mac address stuff might cause cardinality issues.

One other minor labeling suggestion would be to call it master_device to avoid ambiguity.

It might also be useful to make node_bonding_mode an enum type

node_bonding_mode{master_device="bond0",mode="balance-rr"} 0
node_bonding_mode{master_device="bond0",mode="active-backup"} 1
node_bonding_mode{master_device="bond0",mode="802.3ad"} 0
AlexZzz commented 4 years ago

We're using bonding too, it would be great to add Aggregator ID for every link as a gauge(/sys/class/net/bond0/lower_eth3/bonding_slave/ad_aggregator_id): node_bonding_aggregator_id{master_device="bond0", device="eth3"} 1 Different aggregator ID's might be, for example, if switch is misconfigured and has only one link in aggregate. And sometimes it happens due to software failures. Might also be good to add active aggregator ID (/sys/class/net/bond2/bonding/ad_aggregator): node_bonding_active_aggregator_id{master_device="bond0"} 1

nacc commented 4 years ago

Hi, we also use bonding heavily in our infrastructure and I would be very interested in exposing some of these metrics as well. What is the status of this work?

discordianfish commented 4 years ago

I don't think anything was done in that regard. Feel free to submit a PR. But please note that procfs interactions, if there needs to be added/change anything, should go into : https://github.com/prometheus/procfs

hoffie commented 4 years ago

Hi, we also use bonding heavily in our infrastructure and I would be very interested in exposing some of these metrics as well. What is the status of this work?

I'm very sorry, but I still didn't get around to finishing the work on this. I started moving the existing parsing into procfs. I'm sharing my work-in-progress here, but it's far from complete, needs rebasing unto more recent changes and the procfs changes are hacked into vendor/ instead of making them in the appropriate project: https://github.com/hoffie/node_exporter/tree/bonding

If anyone wants to pick this up, feel free to (maybe leave a short comment here). I have still interest in this, but cannot promise when I'll be able to finish it.

frenkye commented 2 years ago

Came across this issue on our side today, would be nice if the info about aggregator ids could be implemented.

The iface with id 3 is out of bond aggregation.

$ cat /sys/class/net/bond0/bonding/slaves
enp96s0f0 enp216s0f0
$ cat /sys/class/net/bond0/bonding/ad_aggregator
2
$ cat /sys/class/net/enp96s0f0/bonding_slave/ad_aggregator_id
2
$ cat /sys/class/net/enp216s0f0/bonding_slave/ad_aggregator_id
3
bewing commented 2 years ago

I came across this today looking into monitoring switch-side misconfigurations (LACP bond with no active members on the host side). I think I have some time next week to look at extending the procfs module to collect these statistics and the bonding collector to export them.

rexagod commented 6 months ago

@bewing Are you working on this? If not, I can take this up.

bewing commented 6 months ago

I started on this a year ago, and got as far as opening some issues in related projects, identifying the need to flesh out test fixtures in https://github.com/prometheus/node_exporter/pull/2347 so as to improve procps via https://github.com/prometheus/procfs/pull/439 to make the metrics available.

The working directories are lost to the ethos, and surely the ground under them has moved. It's back to a fresh start at this point, but maybe the actual code changes in the procps pull are still useful. I have not had time to work on this, and encourage others to make the attempt if they can.