Closed jpecar closed 1 year ago
I also see the timeout staying at 30s even when my nhc.conf clearly has TIMEOUT=60
Mar 29 07:40:05 rtx-04 slurmd[1157807]: slurmd: debug: attempting to run health_check [/usr/local/bin/node_monitor] Mar 29 07:40:05 rtx-04 systemd[1]: Starting system activity accounting tool... Mar 29 07:40:05 rtx-04 systemd[1]: sysstat-collect.service: Succeeded. Mar 29 07:40:05 rtx-04 systemd[1]: Started system activity accounting tool. Mar 29 07:40:16 rtx-04 xinetd[3080]: START: nrpe pid=41501 from=::ffff:172.21.21.45 Mar 29 07:40:16 rtx-04 sudo[41505]: nrpe : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/ipmitool -I open sdr elist Mar 29 07:40:18 rtx-04 xinetd[3080]: EXIT: nrpe status=0 pid=41501 duration=2(sec) Mar 29 07:40:35 rtx-04 nhc[41547]: Health check failed: Script timed out while executing "check_ps_service -u root -S sshd".
This is nhc 1.4.2
Try setting this in /etc/sysconfig/nhc, I've never gotten it to work from nhc.conf but it works for me from /etc/sysconfig/nhc.
griznog
On Wed, Mar 29, 2023 at 8:01 AM Paul Raines @.***> wrote:
I also see the timeout staying at 30s even when my nhc.conf clearly has TIMEOUT=60
Mar 29 07:40:05 rtx-04 slurmd[1157807]: slurmd: debug: attempting to run health_check [/usr/local/bin/node_monitor] Mar 29 07:40:05 rtx-04 systemd[1]: Starting system activity accounting tool... Mar 29 07:40:05 rtx-04 systemd[1]: sysstat-collect.service: Succeeded. Mar 29 07:40:05 rtx-04 systemd[1]: Started system activity accounting tool. Mar 29 07:40:16 rtx-04 xinetd[3080]: START: nrpe pid=41501 from=::ffff:172.21.21.45 Mar 29 07:40:16 rtx-04 sudo[41505]: nrpe : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/ipmitool -I open sdr elist Mar 29 07:40:18 rtx-04 xinetd[3080]: EXIT: nrpe status=0 pid=41501 duration=2(sec) Mar 29 07:40:35 rtx-04 nhc[41547]: Health check failed: Script timed out while executing "check_ps_service -u root -S sshd".
This is nhc 1.4.2
— Reply to this email directly, view it on GitHub https://github.com/mej/nhc/issues/118#issuecomment-1488563421, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB4PKWL5T7N7JLJJZVESNLW6QXDTANCNFSM6AAAAAASCVJEQM . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hi Paul!
I also see the timeout staying at 30s even when my nhc.conf clearly has TIMEOUT=60
@griznog is correct. I tried to explain it in the documentation, but it's easy to miss. :)
There are certain variables, of which TIMEOUT
is one, whose values get used by NHC prior to the execution of the instructions in nhc.conf
. In order to alter the values of such variables, the assignment must occur in one of 3 places:
/etc/sysconfig/nhc
are loaded very early in the execution process, so you can set TIMEOUT
here. One word of caution, though: this file affects all contexts of NHC, not just the default one. (If you don't use separate NHC contexts, you can ignore this part.)TIMEOUT=60
to the end of the nhc
invocation (e.g., nhc -a TIMEOUT=60
) will work too.TIMEOUT
in particular, there is a corresponding command line argument for setting this value, so you also have the option of appending -t 60
to your launch command.Any of these 3 choices will allow you to set your desired 60-second timeout.
Thanks. I will change to set -t 60 in my nhc calls.
Based on testing and feedback, #121 has addressed this issue sufficiently to warrant its closure; however, if your own testing or deployment experience(s) differ, please do reopen this one, or a new one, at your discretion! 😃
Hi, we recently deployed some 2x 64c epyc servers with all 256 threads enabled. I was surprised to discover that nhc always times out on these machines. With some poking around I ended up discovering that /proc/cpu parsing in nhc_hw_gather_data() takes 31.5s to finish. I tested it around on different machines and for me it takes 0.6s on 16c/32t node, 9.5s on 64c/128t node and as I said, 31.5s on 128c/256t node. Funny enough, watchdog always kicks in on exactly 30s, no matter what I set TIMEOUT to. That's another thing I have to look into. But for now, are there any pure bash options to speed up that loop? For now I violated the pure bash approach of nhc and replaced that whole loop with simple