openSUSE / wicked

Framework for network configuration
https://en.opensuse.org/Portal:Wicked
GNU General Public License v2.0
101 stars 50 forks source link

[Question] Multiple ni_process_reap: issue in wickedd logs #741

Open MikhailKasimov opened 6 years ago

MikhailKasimov commented 6 years ago

Hello!

Have Leap 42.3 installed on VirtualBox.

Wicked version: 0.6.40.

Multiple ni_process_reap: issue in logs, such like:

-- Reboot --
мар 11 18:02:25 linux-mk500 systemd[1]: Starting wicked network management service daemon...
мар 11 18:02:25 linux-mk500 systemd[1]: Started wicked network management service daemon.
мар 11 18:02:28 linux-mk500 wickedd[870]: ni_process_reap: process 1172 has not exited yet; now doing a blocking waitpid()
мар 11 18:13:27 linux-mk500 systemd[1]: Stopping wicked network management service daemon...
мар 11 18:13:27 linux-mk500 systemd[1]: Stopped wicked network management service daemon.
-- Reboot --
мар 11 18:13:46 linux-mk500 systemd[1]: Starting wicked network management service daemon...
мар 11 18:13:46 linux-mk500 systemd[1]: Started wicked network management service daemon.
мар 11 18:13:49 linux-mk500 wickedd[876]: ni_process_reap: process 1176 has not exited yet; now doing a blocking waitpid()
мар 11 19:15:47 linux-mk500 systemd[1]: Stopping wicked network management service daemon...
мар 11 19:15:47 linux-mk500 systemd[1]: Stopped wicked network management service daemon.
-- Reboot --
мар 11 19:17:01 linux-mk500 systemd[1]: Starting wicked network management service daemon...
мар 11 19:17:01 linux-mk500 systemd[1]: Started wicked network management service daemon.
мар 11 19:17:03 linux-mk500 wickedd[892]: ni_process_reap: process 1188 has not exited yet; now doing a blocking waitpid()

journalctl -e _PID=1188 command gives no result. So, I can't catch what the problem process (1176, 1188, etc) is.

Can you, please, give a hint what ni_process_reap: is and how to solve it, if it is problem indeed?

Thanks! wickedd_0_6_40_ni_process_reap.txt

archiecobbs commented 3 years ago

Still happening 3.5 years later... lots of errors... very verbose...

openSUSE Leap 15.3 wicked-0.6.65-2.1.x86_64

$ tail -n 12 /var/log/warn
Sep 14 14:25:49 demo wickedd[985]: ni_process_reap: process 7495 has not exited yet; now doing a blocking waitpid()
Sep 14 14:28:25 demo wickedd[985]: ni_process_reap: process 5925 has not exited yet; now doing a blocking waitpid()
Sep 14 14:45:08 demo wickedd[985]: ni_process_reap: process 10878 has not exited yet; now doing a blocking waitpid()
Sep 14 14:55:49 demo wickedd[985]: ni_process_reap: process 8612 has not exited yet; now doing a blocking waitpid()
Sep 14 15:09:21 demo wickedd[985]: ni_process_reap: process 8084 has not exited yet; now doing a blocking waitpid()
Sep 14 15:19:17 demo wickedd[985]: ni_process_reap: process 29526 has not exited yet; now doing a blocking waitpid()
Sep 14 15:32:16 demo wickedd[985]: ni_process_reap: process 22557 has not exited yet; now doing a blocking waitpid()
Sep 14 15:34:41 demo wickedd[985]: ni_process_reap: process 19040 has not exited yet; now doing a blocking waitpid()
Sep 14 15:47:04 demo wickedd[985]: ni_process_reap: process 4441 has not exited yet; now doing a blocking waitpid()
Sep 14 15:57:26 demo wickedd[985]: ni_process_reap: process 30766 has not exited yet; now doing a blocking waitpid()

What information do you need from me to debug this??

mtomaschewski commented 3 years ago

It's about the handle_hangup callback called on poll(2) hangup event:

       POLLHUP
              Hang up (only returned in revents; ignored in events).  Note that when reading from
              a channel such as a pipe or a stream socket, this event merely indicates  that  the
              peer  closed its end of the channel.  Subsequent reads from the channel will return
              0 (end of file) only after all outstanding data in the channel has been consumed.

In src/process.c, there is a:

        rv = waitpid(pi->pid, &pi->status, WNOHANG);
        if (rv == 0) {
                /* This is an ugly workaround. Sometimes, we seem to get a hangup on the socket even
                 * though the script (provably) still has its end of the socket pair open for writing. */
                ni_error("%s: process %u has not exited yet; now doing a blocking waitpid()",
                                __func__, pi->pid);
                rv = waitpid(pi->pid, &pi->status, 0);
        }

I've pushed some test code to https://github.com/mtomaschewski/wicked/tree/pipe-test branch which may be helpful to narrow down what is causing it -- it shows the command that were executed and there is a skeleton in testing using a very simple ni_process exec.

# journalctl | grep ni_process_reap
# journalctl | wc -l
114384

I didn't found it in the logs on my machines. As you're getting them, could you please provide the logs with the command?

archiecobbs commented 3 years ago

Here's what I see (so far) after applying the patch and rebooting:

Sep 15 13:07:38 demo wickedd[5818]: ni_process_reap: process 5822 (/etc/wicked/extensions/firewall up) has not exited yet; now doing a blocking waitpid()
Sep 15 13:07:38 demo wickedd[5818]: ni_process_reap: process 5822 (/etc/wicked/extensions/firewall up) reaped in blocking waitpid after 0m0.000s
Sep 15 13:07:47 demo wickedd[5818]: ni_process_reap: process 7890 (/etc/wicked/extensions/netconfig batch) has not exited yet; now doing a blocking waitpid()
Sep 15 13:07:47 demo wickedd[5818]: ni_process_reap: process 7890 (/etc/wicked/extensions/netconfig batch) reaped in blocking waitpid after 0m0.000s
Sep 15 13:30:15 demo wickedd[973]: ni_process_reap: process 4533 (/etc/wicked/extensions/netconfig batch) has not exited yet; now doing a blocking waitpid()
Sep 15 13:30:15 demo wickedd[973]: ni_process_reap: process 4533 (/etc/wicked/extensions/netconfig batch) reaped in blocking waitpid after 0m0.000s
Sep 15 13:32:36 demo wickedd[973]: ni_process_reap: process 30420 (/etc/wicked/extensions/netconfig batch) has not exited yet; now doing a blocking waitpid()
Sep 15 13:32:36 demo wickedd[973]: ni_process_reap: process 30420 (/etc/wicked/extensions/netconfig batch) reaped in blocking waitpid after 0m0.000s

Looking at the code and noting that the wait times above are all less that one millisecond, it appears that the bug in the code is simply a mistaken assumption: Just because a socket that is connected to process X has been closed, that doesn't mean that process X has completed exiting.

So it's possible that the first invocation of wait() will be too soon; if so, instead of complaining, the code should wait for longer for the process to exit (say, up to 100ms?) before then complaining.

mtomaschewski commented 3 years ago

mistaken assumption

Yes. A process can close it's fd's and still do something (e.g. cleanup) before it exits. Further, it can also be, that the process already exited, just the kernel didn't cleaned up the process resources yet and is not ready to report the exit status yet. We use the ni_process functions for many things, to send or to receive some data (or nothing), so it's a quite important place with several use-cases. And "almost always" we need the data + error code at the end.

(When you look into e.g. extensions/firewall and schema/firewall.xml, you'll see that we send data to the scripts, in firewall case it's the <firewall><zone>foobar</zone></firewall> config node that is passed as <arguments><config><zone>foobar</zone></config></arguments>)

instead of complaining, the code should wait for longer for the process to exit (say, up to 100ms?) before then complaining.

Yes, I'd tend keep it as is (no fixed "deadline" or e.g. 100ms what is already "very long time"), but to change the error msg to debug, call blocking waitpid as we did until now and then either log debug message again (so we see in debug how long it took) or when it took 1sec or more, log this as error or warning?

mtomaschewski commented 3 years ago

Better would be to carefully change the code to use a signal handler (use signalfd, which allows to poll them too) and finally call the notify_callback, but this would be much more intrusive than the above and needs extensive tests.

archiecobbs commented 3 years ago

instead of complaining, the code should wait for longer for the process to exit (say, up to 100ms?) before then complaining.

Yes, I'd tend keep it as is (no fixed "deadline" or e.g. 100ms what is already "very long time"), but to change the error msg to debug, call blocking waitpid as we did until now and then either log debug message again (so we see in debug how long it took) or when it took 1sec or more, log this as error or warning?

Yes I think we're saying more or less the same thing. Your approach is slightly more airtight because it guarantees the process has fully exited. But either way, we would define some "error threshold" of 1 second or whatever that would be considered abnormal and trigger a warning, whereas below this value no warning wold be generated.

Thanks.

mtomaschewski commented 3 years ago

Yes I think we're saying more or less the same thing.

Yes, I think so too.

But either way, we would define some "error threshold" of 1 second or whatever that would be considered abnormal and trigger a warning, whereas below this value no warning wold be generated.

Yes, when you prefer we can keep it an error. When we run into cases where it needs >=1s to get the exit status, there is something fishy and it's needed to take a much closer look at all this (rewrite to use signalfd IMO).

Thanks!