mej / nhc

LBNL Node Health Check
Other
213 stars 78 forks source link

scripts/lbnl_hw.nhc: nhc_hw_gather_data is slow at parsing /proc/cpuinfo #149

Open sdiak opened 3 months ago

sdiak commented 3 months ago

Hello,

on a cluster that I have installed for a client, the function nhc_hw_gather_data() spends a lot of time parsing /proc/cpuinfo.

[root@<redacted> ~]# time nhc

real    0m9.824s
user    0m1.532s
sys     0m8.232s

With the proposed patch, the time is now reduced by a factor of 5+

[root@<redacted>~]# time nhc

real    0m1.692s
user    0m1.563s
sys     0m0.105s

The patch works by caching the content of /proc/cpuinfo in a local variable

Informations :

0001-Makes-proc-cpuinfo-parsing-faster-by-caching-the-con.patch

OleHolmNielsen commented 3 months ago

We have similar nodes, and NHC is somewhat faster:

time nhc

real 0m3.018s user 0m0.253s sys 0m2.084s

OS: Rocky Linux 8.9 (Green Obsidian) Kernel: 4.18.0-513.18.1.el8_9.x86_64 CPU: Dual-socket AMD EPYC 9474F 48-Core Processor

sdiak commented 3 months ago

This is running on a IDLE node where CPUs are set to powersave

sdiak commented 3 months ago

This is a problem on this given site because CPU throttling when a power supply is lost makes slurm draining the node :