Fixed: Naemon stops executing checks and doesnt respawn Core Worker #419

nook24 commented 1 year ago

While reviewing #419, I noticed that Naemon is not taking care about it's dead worker processes. Whenever a worker process dies, it will become a <defunct> zombie process. The reason for this was, that Naemon did not call waitpid() when a worker process dies. I did not find any explanation why this was implemented this way. Could waitpid(wp->pid, &i, 0) hang (wait forever), if the worker process was not forked from the currently running Naemon? We could probably replace this by waitpid(wp->pid, &i, WNOHANG), but I'm not 100% sure about this. Best I could find about this was: https://github.com/NagiosEnterprises/nagioscore/issues/635

sni commented 1 year ago

I tried but couldn't get it working, whenever i kill a worker process i indeed get a message that a new worker got spawned, but a couple of seconds later the main naemon process aborts.

Program received signal SIGSEGV, Segmentation fault.
0x00007f475305bdfd in __strlen_avx2 () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glib2-2.68.4-5.el9.x86_64 libicu-67.1-9.el9.x86_64 libstdc++-11.3.1-2.1.el9.x86_64 pcre-8.44-3.el9.3.x86_64
(gdb) bt
#0  0x00007f475305bdfd in __strlen_avx2 () from /lib64/libc.so.6
#1  0x00007f4753456b68 in dopr (buffer=0x0, maxlen=0, format=0x7f4753459478 "Processing check result file: '%s'\n", args_in=0x7ffe7f668e10) at lib/snprintf.c:712
#2  0x00007f4753457d14 in smb_vsnprintf (str=0x0, count=0, fmt=0x7f4753459418 "Error: Unknown object check type for checkresult: %d; (host_name: %s; service_description: %s)\n", args=0x7ffe7f668e10) at lib/snprintf.c:1197
#3  0x00007f4753457e21 in vasprintf (ptr=0x7ffe7f668e50, format=0x7f4753459418 "Error: Unknown object check type for checkresult: %d; (host_name: %s; service_description: %s)\n", ap=0x7ffe7f668e58) at lib/snprintf.c:1232
#4  0x00007f47533feae9 in nm_log (data_type=1, fmt=0x7f4753459418 "Error: Unknown object check type for checkresult: %d; (host_name: %s; service_description: %s)\n") at src/naemon/logging.c:165
#5  0x00007f47533dcf04 in process_check_result (cr=0x14cbbd0) at src/naemon/checks.c:345
#6  0x00007f47533e21fb in handle_worker_service_check (wpres=0x7ffe7f668ff0, arg=0x14cbbd0, flags=0) at src/naemon/checks_service.c:404
#7  0x00007f475342317c in run_job_callback (job=0x1230520, wpres=0x7ffe7f668ff0, val=0) at src/naemon/workers.c:167
#8  0x00007f4753424046 in handle_worker_result (sd=7, events=1, arg=0x11a92c0) at src/naemon/workers.c:536
#9  0x00007f47534516d9 in iobroker_poll (iobs=0x11a93c0, timeout=785) at lib/iobroker.c:376
#10 0x00007f47533fcd21 in event_poll_full (iobs=0x11a93c0, timeout_ms=1500) at src/naemon/events.c:331
#11 0x00007f47533fce87 in event_poll () at src/naemon/events.c:384
#12 0x0000000000403916 in event_execution_loop () at src/naemon/naemon.c:119
#13 0x0000000000404ca2 in main (argc=2, argv=0x7ffe7f6694f8) at src/naemon/naemon.c:708

I assume it is because we removed the worker structure but it is still referended in some jobs.

nook24 commented 1 year ago

I have tried to wrap my head around this, but I'm not sure about how valid my results are :) I think the reason for the crash is, that wpress is empty (0x0) when the run_job_callback function get's executed: https://github.com/naemon/naemon-core/blob/d6c4c6eaa86c3031c03457f48e74bcdc68aa9451/src/naemon/workers.c#L162-L170

Without the return, the callback function static void handle_worker_host_check or static void handle_worker_service_check would get called. At this point the check_result struct contains pure garbage which leads to the segmentation fault. https://github.com/naemon/naemon-core/blob/d6c4c6eaa86c3031c03457f48e74bcdc68aa9451/src/naemon/checks_host.c#L630-L633 The segmentation fault occurs at the end of the handle_worker_host_check callback, when the cr should get freed.

The part I'm unsure about is, that destroy_job also calls run_job_callback as pass NULL as wpress. I'm not sure what this should cleanup? https://github.com/naemon/naemon-core/blob/d6c4c6eaa86c3031c03457f48e74bcdc68aa9451/src/naemon/workers.c#L176-L187

With the current change, Naemon was not crashing anymore. I created this sketchy script for testing

<?php

exec('ps -eaf | grep "naemon --worker" | grep -v "grep"', $output);
foreach($output as $line){
    $line = explode(" ", $line);
    $pid = $line[6];
    if(is_numeric($pid)){
        echo "kill $pid\n";
        exec('kill '.$pid);
    }
}

watch --interval=5 "php kill.php" allows to test with different killing intervals. Naemon was running in my test for round about 1 hour, while the script killed the worker processes every 5 seconds. No crashes or memory leaks so far but this requires further testing

nook24 commented 11 months ago

Is there anything we can do to get this merged?

sni commented 10 months ago

is it stable now? Did you make any tests lately?

nook24 commented 10 months ago

is it stable now? Did you make any tests lately?

I will setup a test system and report

nook24 commented 10 months ago

Good morning, a quick update on this.

My test system I had setup a system with the latest Naemon stable version and loaded the Statusengine Event Broker. (Ubuntu 22.04) This system was running for 7 days to collect baseline metrics.

On November 2, I have deployed the patched version of Naemon. To make sure that the re-spawning of dead worker processes works as expected a cronjob kills random naemon worker processes every 5 minutes. I have only killed the worker processes (--worker).

Results I did not noticed any difference. The patched Naemon respawned new worker processes as expected. All checks got executed and I did not see any memory leak.

The only thing i noticed is that the "Active service check latency" metric breaks.

Naemon Stable just_plain_naemon

Naemon PR 421

naemon_killing_pr

Naemon killing cronjob naemon_kill_tmux

Naemonstats grafik

nook24 commented 9 months ago

3 weeks later, Naemon is still running fine and re-spawns randomly killed worker every 5 minutes grafik

jacobbaungard commented 9 months ago

I guess this would need a fix for the check latency before it can be merged? Besides that, this is a great feature imo.

sni commented 9 months ago

what exactly breaks the latency here?

jacobbaungard commented 9 months ago

what exactly breaks the latency here?

Not sure why it breaks, I was merely referring to the comments/screenshots by @nook24 above.

nook24 commented 9 months ago

After reading all the event code over and over again, it could be that the latency issue is related to this change: https://github.com/naemon/naemon-core/commit/978bbf3ed4f63e39be180ff6614698e51f4b276d

It is not part of Naemon 1.4.1 which is the version I used for my baseline measurements. @sni mentioned in his commit message that the old calculation for latency is wrong. My test system has 21 Services (13 active / 8 passive) and a active check latency of 603ms (as Naemon 1.4.1 reports) sounds way to high if I think about it.

The current master branch / this branch is not reporting 0 values for latency, i was also able to get values from 1ms up to 55ms. grafik

The code describes the latency value as follows:

/* latency is how long the event lagged behind the event queue */

If I understand this correct, the latency values represents how long the event was stored in the queue before a free worker picked it up for processing. Due to my test system is pretty small and has not much to do, it should not take to long for a free worker to handle the event.

How ever, I'm not sure what this tries to achieve:

https://github.com/naemon/naemon-core/blob/83b25ec4c0f68c51a7f7291d0bfebbef817fdb0f/src/naemon/events.c#L375

sni commented 9 months ago

latency will always be in a range of 0-1000ms because the scheduled checks have 1 second resolution but are going to be randomized on millisecond resolution into the run queue. If that is was is "broken" then it was like that before and fixing that would be purely cosmetic.

sni commented 9 months ago

just one thing, could you remove the "TODO" thing :-) Otherwise i am fine...

nook24 commented 9 months ago

just one thing, could you remove the "TODO" thing :-) Otherwise i am fine...

Uups, I have removed this.

If that is was is "broken" then it was like that before and fixing that would be purely cosmetic.

I'm not sure if it is "broken" or not. As long as you are fine with the reported value for latency I'm fine with it as well I guess

naemon / naemon-core

Fixed: Naemon stops executing checks and doesnt respawn Core Worker #419 #421