mej / nhc

LBNL Node Health Check
Other
213 stars 78 forks source link

check_hw_ib gives an unhelpful message when the adapter is missing #111

Open OleHolmNielsen opened 2 years ago

OleHolmNielsen commented 2 years ago

We're running the NHC 1.4.3 RC1 RPM lbnl-nhc-1.4.3-1.el8.noarch on ~100 AlmaLinux 8.5 systems. These servers have Cornelis (Intel) Omni-Path 100 Gbit adapters, and I check them with this rule in nhc.conf:

d*.nifl.fysik.dtu.dk  || check_hw_ib 100

Due to some hardware testing I removed the adapter, and now NHC rightly gives an error message, albeit a strange one:

ERROR:  nhc:  Health check failed:  check_hw_ib:  Version mismatch between kernel OFED drivers and userspace OFED libraries.

I wonder if a more informative error message could be issued, such as "missing network interface" or similar? Thanks, Ole

OleHolmNielsen commented 2 years ago

I propose a patch to lbnl_hw.nhc: lbnl_hw.nhc.diff.txt When no IB device is present NHC prints this error:

# nhc
ERROR:  nhc:  Health check failed:  check_hw_ib:  No Infiniband device was found.