mej / nhc

LBNL Node Health Check
Other
213 stars 78 forks source link

lbnl_hw: Fixes/speedups for procfs file reads #121

Closed mej closed 1 year ago

mej commented 1 year ago

This branch/changeset extensively refactors lbnl_hw.nhc as a possible solution/improvement for the /proc file I/O problem (e.g., #30). The plan is to extensively refactor lbnl_hw.nhc to:

These changes are intended to fix #30, #39, #43, #47, and #118 as well as some older LANL-internal issues with Trinity (our Haswell/KNL-based, nineteen-thousand-node HPE/Cray XC40).

And with respect to Trinity, I would be remiss were I to fail to express my sincere thanks to @grahamvh, my colleague at @lanl and one of the main sysadmins for that system, who helped me immensely in brainstorming, devising potential solutions, testing, and providing critical feedback en route toward finally getting this problem licked!

Feedback on this approach is much appreciated!

mej commented 1 year ago

These changes have passed testing here at LANL with flying colors, so in the absence of any further feedback, this will be merged in.