lausser / check_nwc_health

nwc = network component. This plugin checks lots of aspects of routers, switches, wlan controllers, firewalls,.....
http://labs.consol.de/nagios/check_nwc_health
GNU General Public License v2.0
147 stars 87 forks source link

mode "cpu-load" for Linux hosts : thresholds modification for load averages not behaving as expected? #252

Open Zitun opened 3 years ago

Zitun commented 3 years ago

Hi Gerhard,

First of all, thank you for your great work on this check_nwc_health plugin!

I was wondering whether you could find some time to let me know your thoughts on the following:

I might be missing something, but in mode "cpu-load", for Linux hosts, if I try and modify the thresholds for load averages, I do not get the expected behaviour:

1) if I set the load averages thresholds to 0, it doesn't disable alerting for those load averages:

./check_nwc_health --hostname 127.0.0.1 --community myCommunity --mode cpu-load --warning 95 --critical 99 --warningx load-1=0 --criticalx load-1=0 --warningx load-5=0 --criticalx load-5=0 --warningx load-15=0 --criticalx load-15=0

CRITICAL - load-1 is 21.67 (1 min Load Average too high (= 21.67)), load-5 is 24.43 (5 min Load Average too high (= 24.43)), load-15 is 23.53 (15 min Load Average too high (= 23.53)), cpu (total): 71.20%, user: 61.25%, system: 9.85%, nice: 0.00%, wait: 0.10%, kernel: 0.00%, interrupt: 0.00% | 'cpu_usage'=71.20%;95;99;0;100 'user_usage'=61.25%;95;99;0;100 'system_usage'=9.85%;95;99;0;100 'nice_usage'=0%;95;99;0;100 'wait_usage'=0.10%;95;99;0;100 'kernel_usage'=0%;95;99;0;100 'interrupt_usage'=0%;95;99;0;100 'load-1'=21.67;0;0;; 'load-5'=24.43;0;0;; 'load-15'=23.53;0;0;;

2) if I set the load averages thresholds to a value greater than 12 (around 40 for example?), any load average value greater than 12 will still trigger an alert, even if lower than the configured thresholds of 40:

./check_nwc_health --hostname 127.0.0.1 --community myCommunity --mode cpu-load --warning 95 --critical 99 --warningx load-1=40 --criticalx load-1=40 --warningx load-5=40 --criticalx load-5=40 --warningx load-15=40 --criticalx load-15=40

CRITICAL - load-1 is 27.41 (1 min Load Average too high (= 27.41)), load-5 is 24.75 (5 min Load Average too high (= 24.75)), load-15 is 23.89 (15 min Load Average too high (= 23.89)), cpu (total): 71.53%, user: 61.87%, system: 9.60%, nice: 0.00%, wait: 0.05%, kernel: 0.00%, interrupt: 0.00% | 'cpu_usage'=71.53%;95;99;0;100 'user_usage'=61.87%;95;99;0;100 'system_usage'=9.60%;95;99;0;100 'nice_usage'=0%;95;99;0;100 'wait_usage'=0.05%;95;99;0;100 'kernel_usage'=0%;95;99;0;100 'interrupt_usage'=0%;95;99;0;100 'load-1'=27.41;40;40;; 'load-5'=24.75;40;40;; 'load-15'=23.89;40;40;;

=> It looks like there is a kind of hardcoded higher limit for the load average thresholds, which is 12 in my case Note: this Linux host VM is running with 6 CPU cores, so maybe there is a correlation, like a hardcoded load average limit of twice the number of cores?

3) however, if I set the load averages thresholds to a value greater than 0 but lower than 12 (the "hardcoded limit" I mentioned in point 2), the configured thresholds are enforced normally (i.e a load average value lower than 12 but greater than the configured threshold will trigger an alert correctly)

Thank you in advance for your feedback on this.

Regards, Olivier