mej / nhc

LBNL Node Health Check
Other
213 stars 78 forks source link

check_cmd_dmesg() Reason Strings Cause Problems #143

Closed mej closed 10 months ago

mej commented 10 months ago

When using check_cmd_dmesg() directly (as written in scripts/lbnl_cmd.nhc) with a negated match string, the default behavior of check_cmd_output() (which check_cmd_dmesg() wraps) used for error reporting causes the "Reason" field to contain not only the match string that was found (and shouldn't have been) but also the line number where the match was found. In the case of dmesg output, the line number is almost completely useless; moreover, it prevents Slurm and other schedulers/RMs from being able to group all the affected nodes together -- because the line numbers almost always differ!

Granted that users/admins can override the default failure message generation behavior (via -M entries, all of which are passed directly to check_cmd_output()), but in the specific case of check_cmd_dmesg(), I think the default behavior should suppress the line numbers and use a simpler, more concise message instead.