This branch/changeset extensively refactors lbnl_hw.nhc as a possible solution/improvement for the /proc file I/O problem (e.g., #30). The plan is to extensively refactor lbnl_hw.nhc to:
Split nhc_hw_gather_data() into distinct functions per section, allowing more fine-grained control over which procfs and/or sysfs files actually need to be read and parsed in the first place;
Convert the path+filename for each file being parsed into a configuration variable that users can customize as needed;
Alter the way NHC is pulling data from each file, by using a single read invocation, to avoid the lseek()/rebuild problems described by @mattmix in #43.
Add support for using a cached copy of the original file (instead of reading directly from /proc or /sys) to avoid the resulting poor performance; and
Allow for the aforementioned configuration variables to specify a process substitution expression (see bash(1) for details) rather than a path+filename. This, in turn, permits the user to, for example, grep out unused lines to minimize parsing time. (For a specific example and rationale, see this comment from @NateCrawford with relevant performance comparisons.)
Improve the unit tests for this module; now that the hardware checks are capable of reading from a user-defined location, the unit tests feed auto-generated data into the checks via the /dev/stdin "file" (really a shell variable). Not only does this verify the new "user-specified custom data source" code, it also expands the "code coverage" of the unit tests. The old tests just directly assigned the test data to the right NHC-internal variables; the new tests cover the parsing code as well, not just the checks themselves.
These changes are intended to fix #30, #39, #43, #47, and #118 as well as some older LANL-internal issues with
Trinity (our Haswell/KNL-based, nineteen-thousand-node HPE/Cray XC40).
And with respect to Trinity, I would be remiss were I to fail to express my sincere thanks to @grahamvh, my colleague at @lanl and one of the main sysadmins for that system, who helped me immensely in brainstorming, devising potential solutions, testing, and providing critical feedback en route toward finally getting this problem licked!
This branch/changeset extensively refactors
lbnl_hw.nhc
as a possible solution/improvement for the/proc
file I/O problem (e.g., #30). The plan is to extensively refactorlbnl_hw.nhc
to:nhc_hw_gather_data()
into distinct functions per section, allowing more fine-grained control over whichprocfs
and/orsysfs
files actually need to be read and parsed in the first place;read
invocation, to avoid thelseek()
/rebuild problems described by @mattmix in #43./proc
or/sys
) to avoid the resulting poor performance; andbash(1)
for details) rather than a path+filename. This, in turn, permits the user to, for example,grep
out unused lines to minimize parsing time. (For a specific example and rationale, see this comment from @NateCrawford with relevant performance comparisons.)/dev/stdin
"file" (really a shell variable). Not only does this verify the new "user-specified custom data source" code, it also expands the "code coverage" of the unit tests. The old tests just directly assigned the test data to the right NHC-internal variables; the new tests cover the parsing code as well, not just the checks themselves.These changes are intended to fix #30, #39, #43, #47, and #118 as well as some older LANL-internal issues with Trinity (our Haswell/KNL-based, nineteen-thousand-node HPE/Cray XC40).
And with respect to Trinity, I would be remiss were I to fail to express my sincere thanks to @grahamvh, my colleague at @lanl and one of the main sysadmins for that system, who helped me immensely in brainstorming, devising potential solutions, testing, and providing critical feedback en route toward finally getting this problem licked!
Feedback on this approach is much appreciated!