prometheus / procfs

procfs provides functions to retrieve system, kernel and process metrics from the pseudo-filesystem proc.
Apache License 2.0
769 stars 319 forks source link

Add new sysfs class for Amazon Elastic Fabric Adapter #515

Open perifaws opened 1 year ago

perifaws commented 1 year ago

This change adds a new sysfs class to read metrics from Amazon Elastic Fabric Adapter (EFA). This change is based on the Infiniband class.

EFA is supported on a variety of Amazon EC2 instances (list here) and is relevant for HPC & distributed training (ML) applications in the same fashion as Infiniband.

There's an associated collector for the node_exporter generated for validation. Happy to provide a sample output as requested. Thanks!

Related to the Prometheus Google Groups thread: https://groups.google.com/g/prometheus-developers/c/MEal59mDebs/m/ZQBU1f0hCAAJ

matthiasr commented 1 year ago

Can you please add some unit tests with examples of what the /sys structure looks like? Otherwise this code will be impossible to maintain with confidence.

dcbw commented 1 year ago

What's EFA specific about the collector? I can't see anywhere that it checks the PCI device ID or something like that for an Amazon VID/PID. Looks like it just looks in the normal infiniband directories?

eg if I have a random Mellanox IB device, will this collector ignore it?