Open hoffie opened 4 years ago
Most of that makes sense to me. The only thing that I would probably leave out is the node_bonding_ad_info
. The mac address stuff might cause cardinality issues.
One other minor labeling suggestion would be to call it master_device
to avoid ambiguity.
It might also be useful to make node_bonding_mode
an enum type
node_bonding_mode{master_device="bond0",mode="balance-rr"} 0
node_bonding_mode{master_device="bond0",mode="active-backup"} 1
node_bonding_mode{master_device="bond0",mode="802.3ad"} 0
We're using bonding too, it would be great to add Aggregator ID
for every link as a gauge(/sys/class/net/bond0/lower_eth3/bonding_slave/ad_aggregator_id):
node_bonding_aggregator_id{master_device="bond0", device="eth3"} 1
Different aggregator ID's might be, for example, if switch is misconfigured and has only one link in aggregate. And sometimes it happens due to software failures.
Might also be good to add active aggregator ID (/sys/class/net/bond2/bonding/ad_aggregator):
node_bonding_active_aggregator_id{master_device="bond0"} 1
Hi, we also use bonding heavily in our infrastructure and I would be very interested in exposing some of these metrics as well. What is the status of this work?
I don't think anything was done in that regard. Feel free to submit a PR. But please note that procfs interactions, if there needs to be added/change anything, should go into : https://github.com/prometheus/procfs
Hi, we also use bonding heavily in our infrastructure and I would be very interested in exposing some of these metrics as well. What is the status of this work?
I'm very sorry, but I still didn't get around to finishing the work on this. I started moving the existing parsing into procfs. I'm sharing my work-in-progress here, but it's far from complete, needs rebasing unto more recent changes and the procfs changes are hacked into vendor/ instead of making them in the appropriate project: https://github.com/hoffie/node_exporter/tree/bonding
If anyone wants to pick this up, feel free to (maybe leave a short comment here). I have still interest in this, but cannot promise when I'll be able to finish it.
Came across this issue on our side today, would be nice if the info about aggregator ids could be implemented.
The iface with id 3 is out of bond aggregation.
$ cat /sys/class/net/bond0/bonding/slaves
enp96s0f0 enp216s0f0
$ cat /sys/class/net/bond0/bonding/ad_aggregator
2
$ cat /sys/class/net/enp96s0f0/bonding_slave/ad_aggregator_id
2
$ cat /sys/class/net/enp216s0f0/bonding_slave/ad_aggregator_id
3
I came across this today looking into monitoring switch-side misconfigurations (LACP bond with no active members on the host side). I think I have some time next week to look at extending the procfs module to collect these statistics and the bonding collector to export them.
@bewing Are you working on this? If not, I can take this up.
I started on this a year ago, and got as far as opening some issues in related projects, identifying the need to flesh out test fixtures in https://github.com/prometheus/node_exporter/pull/2347 so as to improve procps via https://github.com/prometheus/procfs/pull/439 to make the metrics available.
The working directories are lost to the ethos, and surely the ground under them has moved. It's back to a fresh start at this point, but maybe the actual code changes in the procps pull are still useful. I have not had time to work on this, and encourage others to make the attempt if they can.
node_exporter currently exposes details about network bonding, which is great. To be able to monitor more failure cases, we would need additional metrics which we haven't found in node_exporter yet:
Essentially, we would need the following metrics:
It could also make sense to add some more information in the same go. So far, we haven't required these in our alerting, but they may be useful nevertheless:
We currently use a shell script and node_exporter's textfile collector to fill this gap. However, I think it would be useful to support these metrics out-of-the box, especially since only /sys files need to be read.
I'd volunteer to work on PRs against procfs and node_exporter. I would suggest adding this to the existing bonding_linux.go as it is closely related. Looks like this would also imply converting the existing bonding collector to procfs.
Related: #841
@SuperQ @discordianfish @pgier What do you think? Does it make sense in general? Should we only implement the first two metrics or the more generic approach? Do the suggested names make sense?