Closed jframeau closed 5 years ago
I can confirm this, have the same issue in Naemon 1.0.5 with Mod_Gearman. As Workaround i had disabled hostchecks in module.conf.
For now i added a dirty workaround in the mod-gearman module: https://github.com/sni/mod_gearman/commit/d5b9698d341cd64a359f7d46738ed34bab6bfcc8 But this should be fixed in naemon-core, because it forces us to do a unnecessary host lookup. Also it works for services, so i see no reason why hosts and services should be handled in a different way here.
the workaround from 2016 ( sni/mod_gearman@d5b9698 ) strangely does not work for me. The number of old host checks is rising steadily until naemon is restarted. I hope your fix in naemon-core helps. naemon-core 1.0.8 libgearman7 0.33-7 mod-gearman-module 3.0.7
Context:
When asking gearman to distribute host checks, checks are executed once and never after.
Logs
[1477839880.245919] [016.0] [pid=8264] * Running async check of host 'oxox'... [1477839880.246028] [016.2] [pid=8264] Adjusting check attempt number for host 'oxox': current attempt=1/3, state=0, state type=1 [1477839880.246033] [016.2] [pid=8264] New check attempt number = 1 [1477839880.246075] [2320.2] [pid=8264] Raw Command Input: $USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5 [1477839880.246082] [2320.2] [pid=8264] Expanded Command Output: $USER1$/check_icmp -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -p 5 [1477839880.249005] [016.0] [pid=8264] Check of host 'oxox' (id=1) was overridden by a module [1477839881.261262] [016.2] [pid=8264] Processing check result for host 'oxox' [1477839881.261296] [016.1] [pid=8264] * Handling check result for host 'oxox' from 'Mod-Gearman Worker @ centos1'... [1477839881.261302] [016.2] [pid=8264] Check Type: Active [1477839881.261318] [016.2] [pid=8264] Check Options: 0 [1477839881.261322] [016.2] [pid=8264] Scheduled Check?: Yes [1477839881.261325] [016.2] [pid=8264] Exited OK?: Yes [1477839881.261328] [016.2] [pid=8264] Exec Time: 0.019 [1477839881.261337] [016.2] [pid=8264] Latency: 0.000 [1477839881.261341] [016.2] [pid=8264] Return Status: 0 [1477839881.261344] [016.2] [pid=8264] Output: OK - 192.168.1.1: rta 2.252ms, lost 0%|rta=2.252ms;3000.000;5000.000;0; pl=0%;80;100;; rtmax=2.467ms;;;; rtmin=2.073ms;;;;
[1477839881.261395] [016.2] [pid=8264] Parsing check output... [1477839881.261401] [016.2] [pid=8264] Short Output: OK - 192.168.1.1: rta 2.252ms, lost 0% [1477839881.261404] [016.2] [pid=8264] Long Output: NULL [1477839881.261407] [016.2] [pid=8264] Perf Data: rta=2.252ms;3000.000;5000.000;0; pl=0%;80;100;; rtmax=2.467ms;;;; rtmin=2.073ms;;;; [1477839881.261412] [016.2] [pid=8264] Adjusting check attempt number for host 'oxox': current attempt=1/3, state=0, state type=1 [1477839881.261416] [016.2] [pid=8264] New check attempt number = 1 [1477839881.261419] [016.1] [pid=8264] HOST: oxox, ATTEMPT=1/3, CHECK TYPE=ACTIVE, STATE TYPE=HARD, OLD STATE=0, NEW STATE=0 [1477839881.261423] [016.1] [pid=8264] Host was UP. [1477839881.261426] [016.1] [pid=8264] Host is still UP. [1477839881.261429] [016.1] [pid=8264] Pre-handle_host_state() Host: oxox, Attempt=1/3, Type=HARD, Final State=0 (UP) [1477839881.261461] [016.2] [pid=8264] Raw host performance file output: DATATYPE::HOSTPERFDATA TIMET::$TIMET$ HOSTNAME::$HOSTNAME$ HOSTPERFDATA::$HOSTPERFDATA$ HOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$ HOSTSTATE::$HOSTSTATE$ HOSTSTATETYPE::$HOSTSTATETYPE$ [1477839881.261499] [016.2] [pid=8264] Processed host performance data file output: DATATYPE::HOSTPERFDATA TIMET::1477839881 HOSTNAME::oxox HOSTPERFDATA::rta=2.252ms;3000.000;5000.000;0; pl=0%;80;100;; rtmax=2.467ms;;;; rtmin=2.073ms;;;; HOSTCHECKCOMMAND::check-host-alive HOSTSTATE::UP HOSTSTATETYPE::HARD [1477839881.261525] [016.1] [pid=8264] Post-handle_host_state() Host: oxox, Attempt=1/3, Type=HARD, Final State=0 (UP) [1477839881.261532] [016.1] [pid=8264] Checking host 'oxox' for flapping... [1477839881.261536] [016.2] [pid=8264] LFT=5.00, HFT=20.00, CPC=0.00, PSC=0.00% [1477839881.261543] [016.1] [pid=8264] Host is not flapping (0.00% state change). [1477839881.261554] [016.1] [pid=8264] \ Async check result for host 'oxox' handled: new state=0
then one minute later (check_interval = 1)
[1477839940.247067] [016.0] [pid=8264] \ Running async check of host 'oxox'... [1477839940.247072] [016.1] [pid=8264] A check of this host is already being executed, so we'll pass for the moment...
This problem is only for hosts, services are ok.
If i disable host distribution, checks are well scheduled.
Looking at source, i think there is a problem with is_executing attribute in handle_async_host_check_result function which is never resetted, in opposite to handle_async_service_check_result.
Have a look at line 402, https://github.com/naemon/naemon-core/commit/9589e1be07ac7f420eb6d3c374e33a7adced5c49. if (queued_check_result->check_type == CHECK_TYPE_ACTIVE) temp_host->is_executing = FALSE;
has disappeared (perhaps for good reason, but seems there is a side effect).
jfr