Manually drained node got resumed after watchdog timer was unable to terminate hung NHC process

indermuehle-unibe commented 1 year ago

Hi everyone

We are running lbnl-nhc-1.4.2-1.el7 on CentOS 7 with SLURM 20.11.9 and recently had a problematic node, that we manually drained. Suddenly we noticed, that it was failing a bunch of jobs and was back in alloc again. So we found this in the log:

20230429 16:10:57 [slurm] /usr/libexec/nhc/node-mark-online wn-1-30.local
/usr/libexec/nhc/node-mark-online:  Not onlining wn-1-30:  Down for non-NHC reason (ADMIN SHUTDOWN)
NHC watchdog timer 16264 (30 secs) has expired.  Signaling NHC:  kill -s ALRM -- -16263
NHC watchdog timer 16264 terminating NHC:  kill -s TERM -- -16263
NHC watchdog timer 16264 terminating NHC with extreme prejudice:  kill -s KILL -- -16263
ERROR:  nhc:  Health check failed:  Watchdog timer unable to terminate hung NHC process 16263.
20230429 16:16:31 /usr/libexec/nhc/node-mark-offline wn-1-30.local Watchdog timer unable to terminate hung NHC process 16263.
/usr/libexec/nhc/node-mark-offline:  Marking  wn-1-30 offline:  NHC: Watchdog timer unable to terminate hung NHC process 16263.
20230429 16:20:57 [slurm] /usr/libexec/nhc/node-mark-online wn-1-30.local
/usr/libexec/nhc/node-mark-online:  Marking wn-1-30 online and clearing note (NHC: Watchdog timer unable to terminate hung NHC process 16263.)
20230429 16:25:57 [slurm] /usr/libexec/nhc/node-mark-online wn-1-30.local
/usr/libexec/nhc/node-mark-online:  Node wn-1-30 is already online.

As you can see from these logs, the server got resumed by NHC, because it overwrote the manual non-NHC note due to it being unable to terminate the hung NHC process, after which it no longer found anything wrong with the server, which removed the now-belonging-to-NHC note about the watchdog timer.

I read through the code, and something like this seemingly isn't anticipated, is this considered an anomaly that shouldn't happen?

mej commented 1 year ago

I've been looking at this, and something very odd is going on here.

Before I get into that, though, I would highly recommend that you upgrade NHC, at least to the 1.4.3 release. The 1.4.2 release lacks numerous improvements, enhancements, and fixes; the 1.4.3 release came out just over a year ago and has been used reliably at numerous large centers, including LANL in particular. If you're brave (or have a testbed to play with), I'd encourage you to at least try the most recent code or the primary development branch for the upcoming 1.5 release. (The former is just an unmerged branch off the latter; they're almost identical, except that the unmerged branch contains a fix -- I hope -- for #104.) I obviously can't guarantee upgrading will fix the issue you're seeing, but I can tell you it has a lot more robust code for handling Slurm and its various node states.

As for the situation above... Was anything left out of the log, or is that a complete, unedited excerpt? I ask because the 7th line shows /usr/libexec/nhc/node-mark-offline being called with the hostname of wn-1-30.local, but the next line shows a hostname of just wn-1-30. It's possible I'm just missing something, but I don't see any code in that script that could've done that.

In fact, I make it a point to try to do as little hostname manipulation as possible; it's very easy to wind up with a decidedly confused Slurm and/or NHC! 😁 I see other lines with that same mismatch, so I'm not sure what's going on. And it may not be significant anyway...just something that stood out.

I also noticed that the 8th line has another anomaly: Notice the extra space between "Marking" and the hostname? That message comes from node-mark-offline:82, so somehow the $STATUS variable is now blank -- despite the fact that the same $STATUS variable had a value just before that. It had to; line 82 would never have been reached unless $STATUS matched one of these node states (line 66). But as far as I can tell from the code itself, that should be impossible.

So you see why I'm so confused! 😄

If you're able to shed any light on the above, that might help me figure out what transpired. Or, if you're willing, try one of the upgrade recommendations above. Barring that (and assuming you can easily reproduce the above behavior!), if you could change node-mark-offline and node-mark-online so that they each have #!/bin/bash -x as their first line, that might help clear up some of the mysterious behavior I noted above. Running nhc -x might also help.

Good luck; I hope this helps!

indermuehle-unibe commented 1 year ago

Hi @mej

Thanks for the reply. Yes this is indeed an unedited excerpt directly from /var/log/nhc.log.

I can't explain the hostname mismatch either, just looked through the code, and HOSTNAME_S is seemingly almost never used, and our nhc.conf isn't overwriting that in any way either.

The issue is, that this has been a one-time occurance so far. I can't really reproduce it, because with regular NHC errors, it doesn't overwrite the note, see here:

20230502 14:45:27 [slurm] /usr/libexec/nhc/node-mark-online wn-1-30.local
/usr/libexec/nhc/node-mark-online:  Not onlining wn-1-30:  Down for non-NHC reason (ADMIN SHUTDOWN)
ERROR:  nhc:  Health check failed:  check_fs_mount:  /grid/lustre not mounted
20230502 14:47:40 /usr/libexec/nhc/node-mark-offline wn-1-30.local check_fs_mount:  /grid/lustre not mounted
/usr/libexec/nhc/node-mark-offline:  Marking drain wn-1-30 offline:  ADMIN SHUTDOWN
20230502 14:47:47 [slurm] /usr/libexec/nhc/node-mark-online wn-1-30.local
/usr/libexec/nhc/node-mark-online:  Not onlining wn-1-30:  Down for non-NHC reason (ADMIN SHUTDOWN)

But even here, there's the issue that you mention of the hostname mismatch.

Overall I mostly just wanted to ask here to see if this is something that has happened before, and if it hasn't happened before, whether it's actually a bug of the script itself or if this is a weird oddity in our setup. I'm happy with your advice about just upgrading NHC.

mej commented 1 year ago

The issue is, that this has been a one-time occurance so far. I can't really reproduce it, because with regular NHC errors, it doesn't overwrite the note, see here:

20230502 14:45:27 [slurm] /usr/libexec/nhc/node-mark-online wn-1-30.local
/usr/libexec/nhc/node-mark-online:  Not onlining wn-1-30:  Down for non-NHC reason (ADMIN SHUTDOWN)
ERROR:  nhc:  Health check failed:  check_fs_mount:  /grid/lustre not mounted
20230502 14:47:40 /usr/libexec/nhc/node-mark-offline wn-1-30.local check_fs_mount:  /grid/lustre not mounted
/usr/libexec/nhc/node-mark-offline:  Marking drain wn-1-30 offline:  ADMIN SHUTDOWN
20230502 14:47:47 [slurm] /usr/libexec/nhc/node-mark-online wn-1-30.local
/usr/libexec/nhc/node-mark-online:  Not onlining wn-1-30:  Down for non-NHC reason (ADMIN SHUTDOWN)

Does it happen reliably when the watchdog timer times out? If so, reducing the timeout or increasing the checks' runtime should be able to trigger it.

Barring that, the good news is that whatever alteration is going on, it appears to be contained within the node-mark-offline helper. That potentially limits the observational scope of the troubleshooting process.

Unless we can find a scenario that reproduces the problem reliably, chances are that the only way to track this thing down is going to be adding the -x (Bash trace mode) flag to node-mark-offline as I mentioned above. That can generate a whole lot of output to the log file, which in turn can create its own delightful smorgasbord of whacky mishaps....

But even here, there's the issue that you mention of the hostname mismatch.

Yes, very strange.

Overall I mostly just wanted to ask here to see if this is something that has happened before, and if it hasn't happened before, whether it's actually a bug of the script itself or if this is a weird oddity in our setup.

I have never observed this behavior myself, nor has anyone else reported it to me, to the best of my recollection. As far as what is to blame, I'm really not sure. In theory, the node-mark-offline helper script is responsible for making sure existing node notes are not overwritten, so I would tend to place the blame there. But as I said yesterday, it doesn't seem to jive with the code....

I'm happy with your advice about just upgrading NHC.

I think it's worth a try. For one thing, I just rewrote a major chunk of the watchdog timer code, so troubleshooting the old code would be of less value in the long term than testing out the newer, more robust (hopefully) code. I know it can be a lot to ask -- replacing stable, well-seasoned code with newer, less tested code -- so if that's not an option, I totally understand. 😃 And I'm not one of those "run the latest code or I can't help you" type of people, so don't worry about that!

Additionally, I already have an open Issue, #129, for figuring out how to improve the hostname situation. One of the most common problems I do hear about is handling mismatched hostnames -- where the canonical hostname in the kernel UTS namespace (i.e., /proc/sys/kernel/hostname) differs from its Nodename value in slurm.conf -- in a better, simpler way. It's not clear to me that you have that issue, but even if you don't, I think more direct and explicit handling of the node hostname might just help with your situation as well.

mej / nhc

Manually drained node got resumed after watchdog timer was unable to terminate hung NHC process #134