mej / nhc

LBNL Node Health Check
Other
226 stars 79 forks source link

lbnl_cmd.nhc: Remove line numbers from dmesg check #144

Closed mej closed 1 year ago

mej commented 1 year ago

When using check_cmd_dmesg() directly (as written in scripts/lbnl_cmd.nhc) with a negated match string, the default behavior of check_cmd_output() (which check_cmd_dmesg() wraps) used for error reporting causes the "Reason" field to contain not only the match string that was found (and shouldn't have been) but also the line number where the match was found. In the case of dmesg output, the line number is almost completely useless; moreover, it prevents Slurm and other schedulers/RMs from being able to group all the affected nodes together -- because the line numbers almost always differ!

Granted that users/admins can override the default failure message generation behavior (via -M entries, all of which are passed directly to check_cmd_output()), but in the specific case of check_cmd_dmesg(), I think the default behavior should suppress the line numbers and use a simpler, more concise message instead.

This changeset does exactly that by adding a bit of pre-processing to the command-line arguments passed to check_cmd_dmesg() before passing them on to check_cmd_output(). Each match string (-m argument) that doesn't already have a corresponding message (-M argument) to override the default will have a new default provided to it that omits the extraneous information. In other words, any -mmstr that already has a matching -Mmessage will be passed on to check_cmd_output() exactly as it is; any -mmstr that lacks a corresponding -Mmessage — or that has an empty message as a placeholder — will be assigned a new -Mmessage that gets passed to check_cmd_output() without any line number or other dynamic information.

Fixes #143.