mej / nhc

LBNL Node Health Check
Other
226 stars 79 forks source link

nhc_hw_gather_data() too slow on large core count #118

Closed jpecar closed 1 year ago

jpecar commented 1 year ago

Hi, we recently deployed some 2x 64c epyc servers with all 256 threads enabled. I was surprised to discover that nhc always times out on these machines. With some poking around I ended up discovering that /proc/cpu parsing in nhc_hw_gather_data() takes 31.5s to finish. I tested it around on different machines and for me it takes 0.6s on 16c/32t node, 9.5s on 64c/128t node and as I said, 31.5s on 128c/256t node. Funny enough, watchdog always kicks in on exactly 30s, no matter what I set TIMEOUT to. That's another thing I have to look into. But for now, are there any pure bash options to speed up that loop? For now I violated the pure bash approach of nhc and replaced that whole loop with simple

HW_SOCKETS=$(lscpu -be | grep -v CPU | awk '{ print $3 }' | sort | uniq | wc -l)
HW_CORES=$(lscpu -be | grep -v CPU | awk '{ print $4 }' | sort | uniq | wc -l)
HW_THREADS=$(lscpu -be | grep -v CPU | awk '{ print $1 }' | sort | uniq | wc -l)
paulraines68 commented 1 year ago

I also see the timeout staying at 30s even when my nhc.conf clearly has TIMEOUT=60

Mar 29 07:40:05 rtx-04 slurmd[1157807]: slurmd: debug: attempting to run health_check [/usr/local/bin/node_monitor] Mar 29 07:40:05 rtx-04 systemd[1]: Starting system activity accounting tool... Mar 29 07:40:05 rtx-04 systemd[1]: sysstat-collect.service: Succeeded. Mar 29 07:40:05 rtx-04 systemd[1]: Started system activity accounting tool. Mar 29 07:40:16 rtx-04 xinetd[3080]: START: nrpe pid=41501 from=::ffff:172.21.21.45 Mar 29 07:40:16 rtx-04 sudo[41505]: nrpe : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/ipmitool -I open sdr elist Mar 29 07:40:18 rtx-04 xinetd[3080]: EXIT: nrpe status=0 pid=41501 duration=2(sec) Mar 29 07:40:35 rtx-04 nhc[41547]: Health check failed: Script timed out while executing "check_ps_service -u root -S sshd".

This is nhc 1.4.2

griznog commented 1 year ago

Try setting this in /etc/sysconfig/nhc, I've never gotten it to work from nhc.conf but it works for me from /etc/sysconfig/nhc.

griznog

On Wed, Mar 29, 2023 at 8:01 AM Paul Raines @.***> wrote:

I also see the timeout staying at 30s even when my nhc.conf clearly has TIMEOUT=60

Mar 29 07:40:05 rtx-04 slurmd[1157807]: slurmd: debug: attempting to run health_check [/usr/local/bin/node_monitor] Mar 29 07:40:05 rtx-04 systemd[1]: Starting system activity accounting tool... Mar 29 07:40:05 rtx-04 systemd[1]: sysstat-collect.service: Succeeded. Mar 29 07:40:05 rtx-04 systemd[1]: Started system activity accounting tool. Mar 29 07:40:16 rtx-04 xinetd[3080]: START: nrpe pid=41501 from=::ffff:172.21.21.45 Mar 29 07:40:16 rtx-04 sudo[41505]: nrpe : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/ipmitool -I open sdr elist Mar 29 07:40:18 rtx-04 xinetd[3080]: EXIT: nrpe status=0 pid=41501 duration=2(sec) Mar 29 07:40:35 rtx-04 nhc[41547]: Health check failed: Script timed out while executing "check_ps_service -u root -S sshd".

This is nhc 1.4.2

— Reply to this email directly, view it on GitHub https://github.com/mej/nhc/issues/118#issuecomment-1488563421, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB4PKWL5T7N7JLJJZVESNLW6QXDTANCNFSM6AAAAAASCVJEQM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

mej commented 1 year ago

Hi Paul!

I also see the timeout staying at 30s even when my nhc.conf clearly has TIMEOUT=60

@griznog is correct. I tried to explain it in the documentation, but it's easy to miss. :)

There are certain variables, of which TIMEOUT is one, whose values get used by NHC prior to the execution of the instructions in nhc.conf. In order to alter the values of such variables, the assignment must occur in one of 3 places:

Any of these 3 choices will allow you to set your desired 60-second timeout.

paulraines68 commented 1 year ago

Thanks. I will change to set -t 60 in my nhc calls.

mej commented 1 year ago

Based on testing and feedback, #121 has addressed this issue sufficiently to warrant its closure; however, if your own testing or deployment experience(s) differ, please do reopen this one, or a new one, at your discretion! 😃