nhc_hw_gather_data() too slow on large core count

jpecar commented 1 year ago

Hi, we recently deployed some 2x 64c epyc servers with all 256 threads enabled. I was surprised to discover that nhc always times out on these machines. With some poking around I ended up discovering that /proc/cpu parsing in nhc_hw_gather_data() takes 31.5s to finish. I tested it around on different machines and for me it takes 0.6s on 16c/32t node, 9.5s on 64c/128t node and as I said, 31.5s on 128c/256t node. Funny enough, watchdog always kicks in on exactly 30s, no matter what I set TIMEOUT to. That's another thing I have to look into. But for now, are there any pure bash options to speed up that loop? For now I violated the pure bash approach of nhc and replaced that whole loop with simple

HW_SOCKETS=$(lscpu -be | grep -v CPU | awk '{ print $3 }' | sort | uniq | wc -l)
HW_CORES=$(lscpu -be | grep -v CPU | awk '{ print $4 }' | sort | uniq | wc -l)
HW_THREADS=$(lscpu -be | grep -v CPU | awk '{ print $1 }' | sort | uniq | wc -l)

paulraines68 commented 1 year ago

I also see the timeout staying at 30s even when my nhc.conf clearly has TIMEOUT=60

Mar 29 07:40:05 rtx-04 slurmd[1157807]: slurmd: debug: attempting to run health_check [/usr/local/bin/node_monitor] Mar 29 07:40:05 rtx-04 systemd[1]: Starting system activity accounting tool... Mar 29 07:40:05 rtx-04 systemd[1]: sysstat-collect.service: Succeeded. Mar 29 07:40:05 rtx-04 systemd[1]: Started system activity accounting tool. Mar 29 07:40:16 rtx-04 xinetd[3080]: START: nrpe pid=41501 from=::ffff:172.21.21.45 Mar 29 07:40:16 rtx-04 sudo[41505]: nrpe : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/ipmitool -I open sdr elist Mar 29 07:40:18 rtx-04 xinetd[3080]: EXIT: nrpe status=0 pid=41501 duration=2(sec) Mar 29 07:40:35 rtx-04 nhc[41547]: Health check failed: Script timed out while executing "check_ps_service -u root -S sshd".

This is nhc 1.4.2

griznog commented 1 year ago

Try setting this in /etc/sysconfig/nhc, I've never gotten it to work from nhc.conf but it works for me from /etc/sysconfig/nhc.

griznog

On Wed, Mar 29, 2023 at 8:01 AM Paul Raines @.***> wrote:

I also see the timeout staying at 30s even when my nhc.conf clearly has TIMEOUT=60

Mar 29 07:40:05 rtx-04 slurmd[1157807]: slurmd: debug: attempting to run health_check [/usr/local/bin/node_monitor] Mar 29 07:40:05 rtx-04 systemd[1]: Starting system activity accounting tool... Mar 29 07:40:05 rtx-04 systemd[1]: sysstat-collect.service: Succeeded. Mar 29 07:40:05 rtx-04 systemd[1]: Started system activity accounting tool. Mar 29 07:40:16 rtx-04 xinetd[3080]: START: nrpe pid=41501 from=::ffff:172.21.21.45 Mar 29 07:40:16 rtx-04 sudo[41505]: nrpe : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/ipmitool -I open sdr elist Mar 29 07:40:18 rtx-04 xinetd[3080]: EXIT: nrpe status=0 pid=41501 duration=2(sec) Mar 29 07:40:35 rtx-04 nhc[41547]: Health check failed: Script timed out while executing "check_ps_service -u root -S sshd".

This is nhc 1.4.2

— Reply to this email directly, view it on GitHub https://github.com/mej/nhc/issues/118#issuecomment-1488563421, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB4PKWL5T7N7JLJJZVESNLW6QXDTANCNFSM6AAAAAASCVJEQM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mej commented 1 year ago

Hi Paul!

I also see the timeout staying at 30s even when my nhc.conf clearly has TIMEOUT=60

@griznog is correct. I tried to explain it in the documentation, but it's easy to miss. :)

There are certain variables, of which TIMEOUT is one, whose values get used by NHC prior to the execution of the instructions in nhc.conf. In order to alter the values of such variables, the assignment must occur in one of 3 places:

Settings placed in the system-global config file /etc/sysconfig/nhc are loaded very early in the execution process, so you can set TIMEOUT here. One word of caution, though: this file affects all contexts of NHC, not just the default one. (If you don't use separate NHC contexts, you can ignore this part.)
Arbitrary variable settings can be specified on the NHC command line, so appending TIMEOUT=60 to the end of the nhc invocation (e.g., nhc -a TIMEOUT=60) will work too.
In the case of TIMEOUT in particular, there is a corresponding command line argument for setting this value, so you also have the option of appending -t 60 to your launch command.

Any of these 3 choices will allow you to set your desired 60-second timeout.

paulraines68 commented 1 year ago

Thanks. I will change to set -t 60 in my nhc calls.

mej commented 1 year ago

Based on testing and feedback, #121 has addressed this issue sufficiently to warrant its closure; however, if your own testing or deployment experience(s) differ, please do reopen this one, or a new one, at your discretion! 😃

mej / nhc

nhc_hw_gather_data() too slow on large core count #118