prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.26k stars 2.37k forks source link

Add the ethtool counters related to RDMA/ROCE #3137

Open gangxie112 opened 1 month ago

gangxie112 commented 1 month ago

Hi,

It seems that some important metrics in ethtool related to the RDMA/ROCE are not supported, such as tx.pause.ctrl.phy,rx.prio5.pause and etc. Those counters are very important in ROCE network and included in physical/priority port counter.

So, we we have any plan to support them?

discordianfish commented 1 month ago

Dunno how ethtool retrieves them but if there is a way to retrieve them not requiring privileges we're open to a PR for that

dswarbrick commented 1 month ago

Is tx_pause_ctrl_phy vendor or model specific? The only reference to it I can find is for the Mellanox ConnectX series of NICs which use the mlx5 driver, https://www.kernel.org/doc/html/latest/networking/device_drivers/ethernet/mellanox/mlx5/counters.html

In addition to the basic set of ethtool counters which are mature and implemented by pretty much every NIC, there are also quite a few vendor-specific ethtool stats / options.

gangxie112 commented 1 month ago

Yes, those metrics are proprietary to specific nic vendors. But since some nics are widely used, we should at least consider some other way to support it, such as adding a plugging framework. At this time, users have to develop a agent to gather and push the metrics. This is typical way adopted by many cloud providers as far as I know.

dswarbrick commented 1 month ago

The textfile collector feature is arguably the "plugin framework" in node_exporter.

Implementing support natively for vendor- / hardware-specific counters is tricky without having access to said hardware for testing. I would suggest either attempting to implement this yourself (assuming that you have access to such hardware, and are a reasonably proficient Go developer), or loan some hardware to a developer who is willing to do the work.