sni / mod_gearman

Distribute Naemon Host/Service Checks & Eventhandler with Gearman Queues. Host/Servicegroups affinity included.
http://www.mod-gearman.org
GNU General Public License v3.0
122 stars 42 forks source link

Zombie Processes #150

Closed chifac08 closed 2 years ago

chifac08 commented 4 years ago

When a worker finishes or exits arbitrary, there is a small time window where Linux detects the process (child) as a Zombie. This happens after the child exits and before the method clean_worker_exit is called.

Let me show you a short demonstration:

su - site
omd set config MOD_GEARMAN on
omd start example
omd status

Wait until all Worker are up and running. Now start a simple bash Script to monitor our Zombies: COUNTER=0; while true; do ps ax | grep -v grep | grep defunc; if [[ $? -eq 0 ]] then echo $COUNTER ((COUNTER++)) fi sleep 10 done

As you can see, there will be a lot of "defunc" marked worker processes.

I know, that you stop all children when you call the clean_exit method but that won't work for a single process.

I suggest implementing the following method in worker.c:

void child_exit(int signal)
{
    pid_t child_pid;
    gm_log( GM_LOG_TRACE, "caught signal from child %d\n", signal);

    do {
        child_pid = waitpid(-1, &signal, WUNTRACED | WCONTINUED | WNOHANG);
        if (child_pid == -1)
        {
            gm_log(GM_LOG_ERROR, "waitpid failed!");
            exit(EXIT_FAILURE);
        }

       if (WIFEXITED(signal))
       {
           gm_log(GM_LOG_TRACE, "child %d exited, status=%d\n", child_pid, WEXITSTATUS(signal));
       }
       else if (WIFSIGNALED(signal))
       {
           gm_log(GM_LOG_TRACE, "child %d killed by signal %d\n", child_pid, WTERMSIG(signal));
       }
       else if (WIFSTOPPED(signal))
       {
            gm_log(GM_LOG_TRACE, "child %d stopped by signal %d\n", child_pid, WSTOPSIG(signal));
       }
    } while (!WIFEXITED(signal) && !WIFSIGNALED(signal));
}

and of course we also have to install a signal handler that catches the SIGCHLD Signal from the child.

method make_new_child: signal(SIGCHLD, child_exit);

The live of every Zombie may only last some seconds before he gets wiped out and therefore we do not need to worry about resources and free process ids but I would be delighted if you could fix it because my monitoring software complains about it.

if you need any further information, feel free to contact me!

Thanks!

sni commented 4 years ago

Which OMD version is that? The latest release does not use this Worker anymore (only the neb module). The worker has been rewritten from scratch here https://github.com/ConSol/mod-gearman-worker-go/ That one is also enabled by default in OMD (at least since the 3.20)

chifac08 commented 4 years ago

Version: omd-3.2 I am aware that there is a new mod_gearman written in Golang but i prefer the C Version because the monitoring scripts were written for it. I managed to eliminate the behavior with the above mentioned method when the service is not terminated by an SIGALRM.