Closed haxtibal closed 9 years ago
very nice, thanks. Let me dig into that :-)
You're welcome:) Just noted that 08_roundtrip get's executed during CI. This won't work at the moment, because the test depends on running gearman daemon + workers, and because it loops forever until it finds missing results. I initially worte the test to isolate and reproduce a bug where orphaned services occur occasionally. So I just would run the test the whole night, hoping to catch the bug. But it's not suited for CI this way, of course. If you think it's useful for CI, I could try to adapt it accordingly.
yes, all tests are run automatically by travis. But you are right, since the problem is solved, i will just pull in the actual fix. Thanks again, i think this was the last big open issue i know of.
thanks again, i just pushed your changes.
One enqueued check may get executed duplicated in the worker processes, and therefore two results for one check are sent back to the result threads. This happens each time when max_jobs is reached.
I've written a test t/08-roundtrip.c that allows to reproduce this behavior. It fakes a minimum of nagios core functionallity, loads the neb module (similar to t/05-neb.c), initiates a burst of service checks and collects and verifies the check results it get's back from the workers/gearmand:
I think the issue happens because gearman_worker_work is interrupted prematurely, before it can mark the recently executed job as complete at the server. The premature end is caused by calling clean_worker_exit in get_job, whenever max_jobs is reached. There's already a call to gearman_job_send_complete in clean_worker_exit, but that's apparently not enough. If exiting the worker process is moved out of the callback, and after gearman_worker_work, the issue does no longer occur with 08_roundtrip.
I'll send a pull request for both the 08_roundtrip test, and the fix that works for me.