mej / nhc

LBNL Node Health Check
Other
213 stars 78 forks source link

Helper scripts are not called when the node fails the health check with Slurm #147

Open szhengac opened 8 months ago

szhengac commented 8 months ago

Hi,

I am testing nhc with Slurm to automatically drain the nodes with ECC uncorrectable error. The nhc log shows the health check fails on the problematic node, but no helper scripts are executed to put the node into drain state. If I manually call the helper script like sudo NHC_RM=slurm bash /usr/libexec/nhc/node-mark-offline ib-vm-25, the node will be put on drain state. How can I enable Slurm and nhc to call the helper scripts automatically when the node fails the health check? Thanks!

/var/log/nhc.log:

Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
Found XID errors: 63
Node Health Check failed.  Check check_xid_errors returned 1
ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
OleHolmNielsen commented 8 months ago

Did you configure slurm.conf to call NHC? We use the line: HealthCheckProgram=/usr/sbin/nhc

szhengac commented 8 months ago

Yes, this was configured. I can see that nhc was called by Slurm, since slurmd.log has the following lines:

[2023-11-08T08:14:23.445] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1

[2023-11-08T08:19:23.705] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1

[2023-11-08T08:24:23.953] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1

[2023-11-08T08:29:24.274] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1

[2023-11-08T08:34:23.492] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  Check check_xid_errors returned 1
OleHolmNielsen commented 8 months ago

I've never seen the check named "check_xid_errors", I wonder where that came from? Did you define this in your nhc.conf file?

szhengac commented 8 months ago

Yes. this is from https://github.com/NVIDIA/deepops. I add this line ib-vm-25 || check_xid_errors to nhc.conf

check_xid_errors() {
        excluded_xid='94'
        xid_list=$(journalctl -b 0  --since "1 hour ago" --no-pager 2> /dev/null | grep "NVRM: Xid" | sed 's/^.*\] \(.*\)/\1/' | awk '{print $9}' | sed 's/,//' | sort -n | uniq | grep -v -E "${excluded_xid}" | paste -s -d,)
        if [ x"$xid_list" != x"" ]; then
                echo "Found XID errors: $xid_list"
                return 1
        fi
        return 0
}
OleHolmNielsen commented 8 months ago

I don't know about this check. You could try to configure a "fake" check in nhc.conf on the node, like adding a check of check_hw_physmem for values that are definitely wrong. This should cause slurmd to mark the node offline next time it calls NHC. Make sure to configure all the NHC parameters in slurm.conf, for example:

HealthCheckProgram=/usr/sbin/nhc HealthCheckInterval=3600 HealthCheckNodeState=ANY

The default value of HealthCheckInterval is 0 which disables NHC!

BTW, which version of Slurm do you run?

szhengac commented 8 months ago

I am using HealthCheckNodeState=IDLE. Do I need to use ANY? HealthCheckInterval=300 in my slurm.conf

I am using slurm 23.02.4

szhengac commented 8 months ago

I tried the standard check check_hw_cpuinfo in nhc but still got no luck. The helper scripts are still not called.

[2023-11-08T20:44:23.753] error: health_check failed: rc:1 output:ERROR:  nhc:  Health check failed:  check_hw_cpuinfo:  Actual CPU thread count (176) does not match expected (1760).
szhengac commented 8 months ago

@OleHolmNielsen I think the helper script should be run by nhc rather than Slurm? Based on the log, nhc is definitely executed by Slurm.

szhengac commented 8 months ago

I can now confirm that it is a bug in nhc 1.4.3. Reinstalling with 1.4.3 again does not work, but reinstalling with 1.4.2 corrects this bug.

KasperSkytte commented 6 months ago

I can confirm 1.4.3 doesn't run the /usr/libexec/nhc/node-mark-offline script to drain a node with failing checks. I downgraded to 1.4.2 as @szhengac suggested, and it works fine. I tried debugging a bit, but didn't find the cause.

KasperSkytte commented 6 months ago

add: the node-mark-offline script itself works just fine.

jbd commented 5 months ago

Hello,

fwiw, I had the problem in 1.4.3 because the scontrol was not in the PATH and the auto-detection didn't work in nhcmain_find_rm. Setting NHC_RM in /etc/sysconfig/nhc worked for me.

Zoidmania commented 4 months ago

I believe we're being affected by this issue as well. Any movement on this? I'm experiencing exactly the same behavior as @szhengac, and I'm at my wit's end.