Closed infraweavers closed 6 years ago
I've attached some snippets of the logs. You can see that at 2018-05-02 10:07:41
the worker has executed the check, and puts the result on the queue:
[2018-05-02 10:07:41][17105][TRACE] add_job_to_queue(check_results, (null), 2, 1, 1, 1)
however in the neb log, we don't see any mention of localhost58
anytime after 2018-05-02 10:07:37
worker-snippet.log neb-snippet.log
Edit: I've also added the full neb log for 10:07:40->43 in-case you can spot something related to the results queue that I can't see.
We've managed to track down what is happening.
In the situation that the dupserver
connection fails or has an issue adding an item to the queue it resets the client and creates a new one (https://github.com/sni/mod_gearman/blob/master/common/gearman_utils.c#L207) using the server_list
from the dupserver client. However create_client
explicitly overrides current_client
with the value it's created. So basically after a failure on the dupserver the dupserver becomes the server that receives all the active checks and the original server never receives results again.
We'll submit a PR fixing this tomorrow @sni
Hello,
We have a problem similiar to #111 however it is in a well defined scenario; we're using mod-gearman to create a form of HA between our naemon servers. When we restart our 'slave' omd instance, the host checks on the master behave like #111. In the logs we see the host check execute, however we never see "host job completed" in the
neb
log for one of the hosts that is hanging. If we go into the GUI and force check the host then the problem goes away. So it looks like the is_executing on the host is never getting set to FALSE.We run OMD-Labs-2.70 using naemon core like:
We have 2 servers (omd1, omd2); configured using keepalived so that only 1 of the 2 has active checks configured at any point in time like:
on omd1:
on omd2:
We have gearman-worker configured to duplicate the results so that both servers get an up-to-date view and perfdata (and therefore pnp4nagios graphs)
so on omd1:
and on omd2:
We've created 100 hosts which only ping localhost to test this and basically everytime we run
omd restart
on omd2; it hangs some or all of the checks on omd1.Now if we set in
worker.cfg
max-jobs=1
; then it appears that the problem goes away however we aren't getting the best of gearman at all.We have this reproduced in multiple production instances and on test vm's. Any more information or tests to run, I'll be happy to run them but I'm at a bit of a loss where to go from here to debug it. The results just vanish!