Open MikhailKasimov opened 6 years ago
Still happening 3.5 years later... lots of errors... very verbose...
openSUSE Leap 15.3 wicked-0.6.65-2.1.x86_64
$ tail -n 12 /var/log/warn
Sep 14 14:25:49 demo wickedd[985]: ni_process_reap: process 7495 has not exited yet; now doing a blocking waitpid()
Sep 14 14:28:25 demo wickedd[985]: ni_process_reap: process 5925 has not exited yet; now doing a blocking waitpid()
Sep 14 14:45:08 demo wickedd[985]: ni_process_reap: process 10878 has not exited yet; now doing a blocking waitpid()
Sep 14 14:55:49 demo wickedd[985]: ni_process_reap: process 8612 has not exited yet; now doing a blocking waitpid()
Sep 14 15:09:21 demo wickedd[985]: ni_process_reap: process 8084 has not exited yet; now doing a blocking waitpid()
Sep 14 15:19:17 demo wickedd[985]: ni_process_reap: process 29526 has not exited yet; now doing a blocking waitpid()
Sep 14 15:32:16 demo wickedd[985]: ni_process_reap: process 22557 has not exited yet; now doing a blocking waitpid()
Sep 14 15:34:41 demo wickedd[985]: ni_process_reap: process 19040 has not exited yet; now doing a blocking waitpid()
Sep 14 15:47:04 demo wickedd[985]: ni_process_reap: process 4441 has not exited yet; now doing a blocking waitpid()
Sep 14 15:57:26 demo wickedd[985]: ni_process_reap: process 30766 has not exited yet; now doing a blocking waitpid()
What information do you need from me to debug this??
It's about the handle_hangup callback called on poll(2) hangup event:
POLLHUP
Hang up (only returned in revents; ignored in events). Note that when reading from
a channel such as a pipe or a stream socket, this event merely indicates that the
peer closed its end of the channel. Subsequent reads from the channel will return
0 (end of file) only after all outstanding data in the channel has been consumed.
In src/process.c, there is a:
rv = waitpid(pi->pid, &pi->status, WNOHANG);
if (rv == 0) {
/* This is an ugly workaround. Sometimes, we seem to get a hangup on the socket even
* though the script (provably) still has its end of the socket pair open for writing. */
ni_error("%s: process %u has not exited yet; now doing a blocking waitpid()",
__func__, pi->pid);
rv = waitpid(pi->pid, &pi->status, 0);
}
I've pushed some test code to https://github.com/mtomaschewski/wicked/tree/pipe-test branch which may be helpful to narrow down what is causing it -- it shows the command that were executed and there is a skeleton in testing using a very simple ni_process exec.
# journalctl | grep ni_process_reap
# journalctl | wc -l
114384
I didn't found it in the logs on my machines. As you're getting them, could you please provide the logs with the command?
Here's what I see (so far) after applying the patch and rebooting:
Sep 15 13:07:38 demo wickedd[5818]: ni_process_reap: process 5822 (/etc/wicked/extensions/firewall up) has not exited yet; now doing a blocking waitpid()
Sep 15 13:07:38 demo wickedd[5818]: ni_process_reap: process 5822 (/etc/wicked/extensions/firewall up) reaped in blocking waitpid after 0m0.000s
Sep 15 13:07:47 demo wickedd[5818]: ni_process_reap: process 7890 (/etc/wicked/extensions/netconfig batch) has not exited yet; now doing a blocking waitpid()
Sep 15 13:07:47 demo wickedd[5818]: ni_process_reap: process 7890 (/etc/wicked/extensions/netconfig batch) reaped in blocking waitpid after 0m0.000s
Sep 15 13:30:15 demo wickedd[973]: ni_process_reap: process 4533 (/etc/wicked/extensions/netconfig batch) has not exited yet; now doing a blocking waitpid()
Sep 15 13:30:15 demo wickedd[973]: ni_process_reap: process 4533 (/etc/wicked/extensions/netconfig batch) reaped in blocking waitpid after 0m0.000s
Sep 15 13:32:36 demo wickedd[973]: ni_process_reap: process 30420 (/etc/wicked/extensions/netconfig batch) has not exited yet; now doing a blocking waitpid()
Sep 15 13:32:36 demo wickedd[973]: ni_process_reap: process 30420 (/etc/wicked/extensions/netconfig batch) reaped in blocking waitpid after 0m0.000s
Looking at the code and noting that the wait times above are all less that one millisecond, it appears that the bug in the code is simply a mistaken assumption: Just because a socket that is connected to process X has been closed, that doesn't mean that process X has completed exiting.
So it's possible that the first invocation of wait()
will be too soon; if so, instead of complaining, the code should wait for longer for the process to exit (say, up to 100ms?) before then complaining.
mistaken assumption
Yes. A process can close it's fd's and still do something (e.g. cleanup) before it exits. Further, it can also be, that the process already exited, just the kernel didn't cleaned up the process resources yet and is not ready to report the exit status yet. We use the ni_process functions for many things, to send or to receive some data (or nothing), so it's a quite important place with several use-cases. And "almost always" we need the data + error code at the end.
(When you look into e.g. extensions/firewall
and schema/firewall.xml
, you'll see that we send data to the scripts, in firewall case it's the <firewall><zone>foobar</zone></firewall>
config node that is passed as <arguments><config><zone>foobar</zone></config></arguments>
)
instead of complaining, the code should wait for longer for the process to exit (say, up to 100ms?) before then complaining.
Yes, I'd tend keep it as is (no fixed "deadline" or e.g. 100ms what is already "very long time"), but to change the error msg to debug, call blocking waitpid as we did until now and then either log debug message again (so we see in debug how long it took) or when it took 1sec or more, log this as error or warning?
Better would be to carefully change the code to use a signal handler (use signalfd, which allows to poll them too) and finally call the notify_callback, but this would be much more intrusive than the above and needs extensive tests.
instead of complaining, the code should wait for longer for the process to exit (say, up to 100ms?) before then complaining.
Yes, I'd tend keep it as is (no fixed "deadline" or e.g. 100ms what is already "very long time"), but to change the error msg to debug, call blocking waitpid as we did until now and then either log debug message again (so we see in debug how long it took) or when it took 1sec or more, log this as error or warning?
Yes I think we're saying more or less the same thing. Your approach is slightly more airtight because it guarantees the process has fully exited. But either way, we would define some "error threshold" of 1 second or whatever that would be considered abnormal and trigger a warning, whereas below this value no warning wold be generated.
Thanks.
Yes I think we're saying more or less the same thing.
Yes, I think so too.
But either way, we would define some "error threshold" of 1 second or whatever that would be considered abnormal and trigger a warning, whereas below this value no warning wold be generated.
Yes, when you prefer we can keep it an error. When we run into cases where it needs >=1s to get the exit status, there is something fishy and it's needed to take a much closer look at all this (rewrite to use signalfd IMO).
Thanks!
Hello!
Have Leap 42.3 installed on VirtualBox.
Wicked version: 0.6.40.
Multiple ni_process_reap: issue in logs, such like:
journalctl -e _PID=1188 command gives no result. So, I can't catch what the problem process (1176, 1188, etc) is.
Can you, please, give a hint what
ni_process_reap:
is and how to solve it, if it is problem indeed?Thanks! wickedd_0_6_40_ni_process_reap.txt