sni / mod_gearman

Distribute Naemon Host/Service Checks & Eventhandler with Gearman Queues. Host/Servicegroups affinity included.
http://www.mod-gearman.org
GNU General Public License v3.0
122 stars 42 forks source link

Checks may get executed duplicated #68

Closed haxtibal closed 9 years ago

haxtibal commented 9 years ago

One enqueued check may get executed duplicated in the worker processes, and therefore two results for one check are sent back to the result threads. This happens each time when max_jobs is reached.

I've written a test t/08-roundtrip.c that allows to reproduce this behavior. It fakes a minimum of nagios core functionallity, loads the neb module (similar to t/05-neb.c), initiates a burst of service checks and collects and verifies the check results it get's back from the workers/gearmand:

$ sudo service gearman-job-server restart
$ ./mod_gearman_worker --config=extras/shared.conf
$ ./08_roundtrip 

got module api version 4[2014-12-07 23:19:01][22269][ERROR] unknown option 'timeout_result'
neb module registered a function for NEBCALLBACK_PROCESS_DATA.
neb module registered a function for NEBCALLBACK_TIMED_EVENT_DATA.
logger: mod_gearman: initialized version 2.0.0b1 (libgearman 0.33)
run nebmodule_init() -> 0
unhandled registration for type 7
neb module registered function for NEBCALLBACK_SERVICE_CHECK_DATA.
unhandled registration for type 4
Sending 500 service checks.
Waiting for all results to arrive...
 received 0 from 500
 received 20 from 500
 received 100 from 500
 received 150 from 500
 received 200 from 500
 received 250 from 500
 received 300 from 500
 received 350 from 500
 received 400 from 500
 received 450 from 500
 received 500 from 500
 received 510 from 500
duplicated results (2) for check 49 detected. Timestamp 1 [1417990752082490683], Timestamp 2 [1417990742362054122]
duplicated results (2) for check 99 detected. Timestamp 1 [1417990752084661053], Timestamp 2 [1417990743171785093]
duplicated results (2) for check 149 detected. Timestamp 1 [1417990752087503005], Timestamp 2 [1417990744191285773]
duplicated results (2) for check 199 detected. Timestamp 1 [1417990752089390908], Timestamp 2 [1417990745205393124]
duplicated results (2) for check 249 detected. Timestamp 1 [1417990752091095125], Timestamp 2 [1417990746208219721]
duplicated results (2) for check 299 detected. Timestamp 1 [1417990752092855355], Timestamp 2 [1417990747198374777]
duplicated results (2) for check 349 detected. Timestamp 1 [1417990752094618000], Timestamp 2 [1417990748190704708]
duplicated results (2) for check 399 detected. Timestamp 1 [1417990752096534158], Timestamp 2 [1417990749215450809]
duplicated results (2) for check 449 detected. Timestamp 1 [1417990752098250960], Timestamp 2 [1417990750177434364]
duplicated results (2) for check 499 detected. Timestamp 1 [1417990752099997709], Timestamp 2 [1417990751177636357]
verifying 510 results done

I think the issue happens because gearman_worker_work is interrupted prematurely, before it can mark the recently executed job as complete at the server. The premature end is caused by calling clean_worker_exit in get_job, whenever max_jobs is reached. There's already a call to gearman_job_send_complete in clean_worker_exit, but that's apparently not enough. If exiting the worker process is moved out of the callback, and after gearman_worker_work, the issue does no longer occur with 08_roundtrip.

I'll send a pull request for both the 08_roundtrip test, and the fix that works for me.

sni commented 9 years ago

very nice, thanks. Let me dig into that :-)

haxtibal commented 9 years ago

You're welcome:) Just noted that 08_roundtrip get's executed during CI. This won't work at the moment, because the test depends on running gearman daemon + workers, and because it loops forever until it finds missing results. I initially worte the test to isolate and reproduce a bug where orphaned services occur occasionally. So I just would run the test the whole night, hoping to catch the bug. But it's not suited for CI this way, of course. If you think it's useful for CI, I could try to adapt it accordingly.

sni commented 9 years ago

yes, all tests are run automatically by travis. But you are right, since the problem is solved, i will just pull in the actual fix. Thanks again, i think this was the last big open issue i know of.

sni commented 9 years ago

thanks again, i just pushed your changes.