sni / mod_gearman

Distribute Naemon Host/Service Checks & Eventhandler with Gearman Queues. Host/Servicegroups affinity included.
http://www.mod-gearman.org
GNU General Public License v3.0
122 stars 42 forks source link

Freshness check stuck due to is_being_freshened not being reset when using mod_gearman #36

Closed buraks78-sn closed 11 years ago

buraks78-sn commented 11 years ago

Icinga does not seem to execute subsequent freshness checks after the initial one is executed. This happens only when mod_gearman is used as broker. I have submitted a bug for Icinga but was pretty much directed here. You can find the full bug report along with the debugging setup and output at https://dev.icinga.org/issues/3489. Could this be a mod_gearman issue?

sni commented 11 years ago

According to the issue reported, you are using a old icinga version and a old mod-gearman version. Could you reproduce that behaviour with recent releases?

buraks78-sn commented 11 years ago

Unfortunately, I did not. Even if I did, my production instance is still stuck at those versions. Can you point me to the fix/change (if there is actually one) so that I can maybe patch the code?

easterhanu commented 11 years ago

I stumbled into this freshness issue with Nagios 4.x and mod_gearman-1.4_nagios4. This is a bug or should I say yet-to-be-implemented feature in mod_gearman.

Like buraks78-sn had already debugged, when Nagios executes a host or service freshness check, it sets the is_being_freshened flag for the host or service object to TRUE, but mod_gearman is missing the code to reset the flag to FALSE after the check has been ran. Later, when Nagios is again looking for any freshness checks to be done, it thinks the host or service in question is already in the process of being freshened (forever) and does nothing. The end result is that the freshness check works only once when using mod_gearman.

A quick fix is to add the flag reset to handle_host_check() and handle_svc_check() before the return clause in mod_gearman.c:

/* unset the freshening flag */ hst->is_being_freshened=FALSE; ... svc->is_being_freshened=FALSE;

Note that Nagios base/checks.c handle_async_service_check_result() additionally has code for discarding invalid freshness check results in a race condition where a passive check results arrives while the freshness check is running. I'm not sure if this functionality can be implemented in a NEB module.

sni commented 11 years ago

the mentioned race should not happen when using mod-gearman because mod-gearman resets the flag before running the check, not afterwards. So just reseting the flag should be enough. I will push the fix after some tests.

easterhanu commented 11 years ago

I don't claim to understand the check and brokering logic completely, but I don't think resetting the freshness flag affects the race condition. Freshness check is an active check, which often uses a dummy plugin to submit a "fake" result to a passive service, while the "real" results are passive.

Assume a passive "SNMP Trap" service with a 10 min freshness treshold - if no traps have been received in the past 10 minutes, Nagios determines the service to be stale and initiates a dummy freshness check, which resets the service state to "OK - No traps in the last 10 min". mod_gearman will intercept the freshness check event and start processing it. A split second later, the PDU in an ethernet switch breaks up and the switch sends a trap resulting the "SNMP Trap" service go to CRITICAL state. Now the service is actually fresh and CRITICAL is the correct state, so the freshness check should be aborted and the dummy OK result rejected. Is this something Nagios will do even if you use mod-gearman, or something mod-gearman would somehow have to implement?

sni commented 11 years ago

Should be fixed with the latest release.