naemon / naemon-core

Networks, Applications and Event Monitor
http://www.naemon.io/
GNU General Public License v2.0
153 stars 63 forks source link

Hosts checked once and never after with module gearman #154

Closed jframeau closed 5 years ago

jframeau commented 7 years ago

Context:

When asking gearman to distribute host checks, checks are executed once and never after.

Logs

[1477839880.245919] [016.0] [pid=8264] * Running async check of host 'oxox'... [1477839880.246028] [016.2] [pid=8264] Adjusting check attempt number for host 'oxox': current attempt=1/3, state=0, state type=1 [1477839880.246033] [016.2] [pid=8264] New check attempt number = 1 [1477839880.246075] [2320.2] [pid=8264] Raw Command Input: $USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5 [1477839880.246082] [2320.2] [pid=8264] Expanded Command Output: $USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5 [1477839880.249005] [016.0] [pid=8264] Check of host 'oxox' (id=1) was overridden by a module [1477839881.261262] [016.2] [pid=8264] Processing check result for host 'oxox' [1477839881.261296] [016.1] [pid=8264] * Handling check result for host 'oxox' from 'Mod-Gearman Worker @ centos1'... [1477839881.261302] [016.2] [pid=8264] Check Type: Active [1477839881.261318] [016.2] [pid=8264] Check Options: 0 [1477839881.261322] [016.2] [pid=8264] Scheduled Check?: Yes [1477839881.261325] [016.2] [pid=8264] Exited OK?: Yes [1477839881.261328] [016.2] [pid=8264] Exec Time: 0.019 [1477839881.261337] [016.2] [pid=8264] Latency: 0.000 [1477839881.261341] [016.2] [pid=8264] Return Status: 0 [1477839881.261344] [016.2] [pid=8264] Output: OK - 192.168.1.1: rta 2.252ms, lost 0%|rta=2.252ms;3000.000;5000.000;0; pl=0%;80;100;; rtmax=2.467ms;;;; rtmin=2.073ms;;;;

[1477839881.261395] [016.2] [pid=8264] Parsing check output... [1477839881.261401] [016.2] [pid=8264] Short Output: OK - 192.168.1.1: rta 2.252ms, lost 0% [1477839881.261404] [016.2] [pid=8264] Long Output: NULL [1477839881.261407] [016.2] [pid=8264] Perf Data: rta=2.252ms;3000.000;5000.000;0; pl=0%;80;100;; rtmax=2.467ms;;;; rtmin=2.073ms;;;; [1477839881.261412] [016.2] [pid=8264] Adjusting check attempt number for host 'oxox': current attempt=1/3, state=0, state type=1 [1477839881.261416] [016.2] [pid=8264] New check attempt number = 1 [1477839881.261419] [016.1] [pid=8264] HOST: oxox, ATTEMPT=1/3, CHECK TYPE=ACTIVE, STATE TYPE=HARD, OLD STATE=0, NEW STATE=0 [1477839881.261423] [016.1] [pid=8264] Host was UP. [1477839881.261426] [016.1] [pid=8264] Host is still UP. [1477839881.261429] [016.1] [pid=8264] Pre-handle_host_state() Host: oxox, Attempt=1/3, Type=HARD, Final State=0 (UP) [1477839881.261461] [016.2] [pid=8264] Raw host performance file output: DATATYPE::HOSTPERFDATA TIMET::$TIMET$ HOSTNAME::$HOSTNAME$ HOSTPERFDATA::$HOSTPERFDATA$ HOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$ HOSTSTATE::$HOSTSTATE$ HOSTSTATETYPE::$HOSTSTATETYPE$ [1477839881.261499] [016.2] [pid=8264] Processed host performance data file output: DATATYPE::HOSTPERFDATA TIMET::1477839881 HOSTNAME::oxox HOSTPERFDATA::rta=2.252ms;3000.000;5000.000;0; pl=0%;80;100;; rtmax=2.467ms;;;; rtmin=2.073ms;;;; HOSTCHECKCOMMAND::check-host-alive HOSTSTATE::UP HOSTSTATETYPE::HARD [1477839881.261525] [016.1] [pid=8264] Post-handle_host_state() Host: oxox, Attempt=1/3, Type=HARD, Final State=0 (UP) [1477839881.261532] [016.1] [pid=8264] Checking host 'oxox' for flapping... [1477839881.261536] [016.2] [pid=8264] LFT=5.00, HFT=20.00, CPC=0.00, PSC=0.00% [1477839881.261543] [016.1] [pid=8264] Host is not flapping (0.00% state change). [1477839881.261554] [016.1] [pid=8264] \ Async check result for host 'oxox' handled: new state=0

then one minute later (check_interval = 1)

[1477839940.247067] [016.0] [pid=8264] \ Running async check of host 'oxox'... [1477839940.247072] [016.1] [pid=8264] A check of this host is already being executed, so we'll pass for the moment...

This problem is only for hosts, services are ok.

If i disable host distribution, checks are well scheduled.

Looking at source, i think there is a problem with is_executing attribute in handle_async_host_check_result function which is never resetted, in opposite to handle_async_service_check_result.

Have a look at line 402, https://github.com/naemon/naemon-core/commit/9589e1be07ac7f420eb6d3c374e33a7adced5c49. if (queued_check_result->check_type == CHECK_TYPE_ACTIVE) temp_host->is_executing = FALSE;

has disappeared (perhaps for good reason, but seems there is a side effect).

jfr

nook24 commented 7 years ago

I can confirm this, have the same issue in Naemon 1.0.5 with Mod_Gearman. As Workaround i had disabled hostchecks in module.conf.

sni commented 7 years ago

For now i added a dirty workaround in the mod-gearman module: https://github.com/sni/mod_gearman/commit/d5b9698d341cd64a359f7d46738ed34bab6bfcc8 But this should be fixed in naemon-core, because it forces us to do a unnecessary host lookup. Also it works for services, so i see no reason why hosts and services should be handled in a different way here.

r-lindner commented 5 years ago

the workaround from 2016 ( sni/mod_gearman@d5b9698 ) strangely does not work for me. The number of old host checks is rising steadily until naemon is restarted. I hope your fix in naemon-core helps. naemon-core 1.0.8 libgearman7 0.33-7 mod-gearman-module 3.0.7