Naemon+mod-gearman checks stalled

dzuleta commented 7 years ago

We have Naemon installed with the latest OMD package (.rpm) with 1k hosts and 4k services. We though at first we could have run into bottleneck troubles, but after some investigations it seems some checks are being stalled, and we still can't figure out why, so if anyone could give me some ideas would be awesome.

Example Host: CL_HC_EL-BELLOTO_IDF_D1_WiFi10

The check interval is 1 minute to ping hosts. The last check time was at 12:49:40, and it keep rescheduling the check every 1 minute but it never launch it (current reschedule time is 13:33, but i will keep moving it, its been 40 minutes now and we have had this issue for days).

This happens to around 20% of the checks. If we reduce the number of checks the problem disappeared.

We have the mod-gearman server and worker on the same machine, so we can discard communication troubles.

Checking the mod_gearman_worker.log we can see the worker got the job and it process successfully.

The mod_gearman_neb.log also has the information about the job:

However, it never says "host job completed" as it says for other hosts. naemon.debug doesnt show anything for the desired host or ip (debug_level=4184)

Does anyone have any idea why its not completed or why it keep rescheduling it without actually doing the check? Thanks, Regards, Daniel

dzuleta commented 7 years ago

Update #1: We greatly reduce the number of hosts and services to 150/550 and problem still persist. I think we can discard performance issues with this. With Nagios 4 we can handle a more lot than this with a similar machine.

In the naemon.log file we are getting Event was cancelled by iobroker input

We have:

16 lines of iobroker messages
Around 21 hosts not doing checks after ~30 minutes

Not sure if this is correlated or not. I will try to run the same amount of checks directly from Naemon instead of using mod-gearman and check if the problem disappear.

Regards, Daniel

Could it be related and maybe a Naemon Core problem? Should i open the issue there?

sni commented 7 years ago

I have to investigate. If the issue is gone when only using Naemon, it really might be a problem in the Mod-Gearman Neb Module. Did i understand you correctly, that the checks are actually executed, and the problem seems occur be when putting the results back into the core?

dzuleta commented 7 years ago

Update #2: We run a test only using Naemon and it worked fine. We keep getting the same iobroker messages (even more than before), but the checks are being execute correctly, none of them has been left behind and its been working as expected for more than 30 minutes.

sni: Thanks you very much for your fast reply and interest in helping. You understood correctly. As I could see on the mod_gearman_neb.log, got the request for the host queue and add the job the worker queue.

Then, in the mod_gearman_worker.log we can see the worker got the job, execute it, got a valid response which is saved in the "data" field and then send it to the check_results queue.

But at the end, the server never got the "host job complete" message as state on the other hosts which successfully run the check.

Could you point me out in the right direction? Can you think of any other test i can run to debug? Thank you very much, Regards, Daniel

dzuleta commented 7 years ago

Update #3: I have installed gearmand from source code. Tried version 0.33 and latest (1.1.15). Both of them run with naemon, but same issue occurs. I have installed mod-gearman from source code. Add a few more lines of debugging. I can confirm issue occurs because mod_gearman_neb is not getting "host job completed" (trying to figure out how to track this)

On the image, first time when is working correctly, at 14:35:18 shows host job completed. When it fails, at 14:51:01, it doesn't show host job completed.

@sni could you give me a hint on how to debug the result_thread.c -->get_results . Is there a unique id or something i can use to identify the host job inside that function? Only at line 220 I get the host name, but not sure if it is failing before that.

Thank you very much, Regards, Daniel

sni commented 7 years ago

Whats seriously wrong are the lines:

[2017-03-09 20:30:57][32070][TRACE] handle_host_check() finished successfully -> 206

This is the line in source:

gm_log( GM_LOG_TRACE, "handle_host_check() finished successfully -> %d\n", NEBERROR_CALLBACKOVERRIDE );

See, it just prints the value of a constant which is defined in naemon headers like:

./src/naemon/neberrors.h:#define NEBERROR_CALLBACKOVERRIDE   206

So not only shows your neb module somehow a wrong value, it even changes over time. This might be some memory corruption. Are there any other modules loaded? Also this line is logged as loglevel Trace, yours shows up as debug. Something is seriously wrong here.

dzuleta commented 7 years ago

My bad @sni, I apologize for not being careful enough in adding some new lines. I just wanted to add some breakpoints to the code for debugging (trace was too much text for 5k services, so i reduce it to debug only and replicate some GM_LOG_TRACE to GM_LOG_DEBUG )

I was printing the buffer1 value instead of NEBERROR_CALLBACKOVERRIDE :(

Since i haven't change any code (I only add some debug lines), I still believe somehow the check never reach the result_thread or is failing somewhere in between. Do you think i am pointing the right direction?

Regards, Daniel

sni commented 6 years ago

If this is a cluster-setup, the previously mentioned patch might help.

sni / mod_gearman

Naemon+mod-gearman checks stalled #111