Open szhengac opened 1 year ago
Did you configure slurm.conf to call NHC? We use the line: HealthCheckProgram=/usr/sbin/nhc
Yes, this was configured. I can see that nhc
was called by Slurm, since slurmd.log has the following lines:
[2023-11-08T08:14:23.445] error: health_check failed: rc:1 output:ERROR: nhc: Health check failed: Check check_xid_errors returned 1
[2023-11-08T08:19:23.705] error: health_check failed: rc:1 output:ERROR: nhc: Health check failed: Check check_xid_errors returned 1
[2023-11-08T08:24:23.953] error: health_check failed: rc:1 output:ERROR: nhc: Health check failed: Check check_xid_errors returned 1
[2023-11-08T08:29:24.274] error: health_check failed: rc:1 output:ERROR: nhc: Health check failed: Check check_xid_errors returned 1
[2023-11-08T08:34:23.492] error: health_check failed: rc:1 output:ERROR: nhc: Health check failed: Check check_xid_errors returned 1
I've never seen the check named "check_xid_errors", I wonder where that came from? Did you define this in your nhc.conf file?
Yes. this is from https://github.com/NVIDIA/deepops. I add this line ib-vm-25 || check_xid_errors
to nhc.conf
check_xid_errors() {
excluded_xid='94'
xid_list=$(journalctl -b 0 --since "1 hour ago" --no-pager 2> /dev/null | grep "NVRM: Xid" | sed 's/^.*\] \(.*\)/\1/' | awk '{print $9}' | sed 's/,//' | sort -n | uniq | grep -v -E "${excluded_xid}" | paste -s -d,)
if [ x"$xid_list" != x"" ]; then
echo "Found XID errors: $xid_list"
return 1
fi
return 0
}
I don't know about this check. You could try to configure a "fake" check in nhc.conf on the node, like adding a check of check_hw_physmem for values that are definitely wrong. This should cause slurmd to mark the node offline next time it calls NHC. Make sure to configure all the NHC parameters in slurm.conf, for example:
HealthCheckProgram=/usr/sbin/nhc HealthCheckInterval=3600 HealthCheckNodeState=ANY
The default value of HealthCheckInterval is 0 which disables NHC!
BTW, which version of Slurm do you run?
I am using HealthCheckNodeState=IDLE
. Do I need to use ANY
? HealthCheckInterval=300
in my slurm.conf
I am using slurm 23.02.4
I tried the standard check check_hw_cpuinfo
in nhc but still got no luck. The helper scripts are still not called.
[2023-11-08T20:44:23.753] error: health_check failed: rc:1 output:ERROR: nhc: Health check failed: check_hw_cpuinfo: Actual CPU thread count (176) does not match expected (1760).
@OleHolmNielsen I think the helper script should be run by nhc rather than Slurm? Based on the log, nhc is definitely executed by Slurm.
I can now confirm that it is a bug in nhc 1.4.3. Reinstalling with 1.4.3 again does not work, but reinstalling with 1.4.2 corrects this bug.
I can confirm 1.4.3 doesn't run the /usr/libexec/nhc/node-mark-offline
script to drain a node with failing checks. I downgraded to 1.4.2 as @szhengac suggested, and it works fine. I tried debugging a bit, but didn't find the cause.
add: the node-mark-offline
script itself works just fine.
Hello,
fwiw, I had the problem in 1.4.3 because the scontrol was not in the PATH and the auto-detection didn't work in nhcmain_find_rm. Setting NHC_RM in /etc/sysconfig/nhc worked for me.
I believe we're being affected by this issue as well. Any movement on this? I'm experiencing exactly the same behavior as @szhengac, and I'm at my wit's end.
Hi,
I am testing nhc with Slurm to automatically drain the nodes with ECC uncorrectable error. The nhc log shows the health check fails on the problematic node, but no helper scripts are executed to put the node into
drain
state. If I manually call the helper script likesudo NHC_RM=slurm bash /usr/libexec/nhc/node-mark-offline ib-vm-25
, the node will be put ondrain
state. How can I enable Slurm and nhc to call the helper scripts automatically when the node fails the health check? Thanks!/var/log/nhc.log: