Open maleyva1 opened 1 week ago
I think that the proposed fix is kinda racy, because if the process is terminated between your check and blocking the signals, then we will not receive SIGCHLD
for it and wait until the timeout expires.
A better approach would be to check for running()
after SIGCHLD
is blocked, I think.
Also, the fix seems to be incomplete. It doesn't make the situation worse, but there are still cases when wait()
will hang. For example, when we have multiple threads, it is not guaranteed that we will receive SIGCHLD
in the correct thread and we will hang because of this. Or, if we wait for more than one process simultaneously from multiple threads. Though, I haven't tried these scenarios myself and got these ideas just by looking at the code.
I think that the proposed fix is kinda racy, because if the process is terminated between your check and blocking the signals, then we will not receive
SIGCHLD
for it and wait until the timeout expires.A better approach would be to check for
running()
afterSIGCHLD
is blocked, I think.
That was my initial approach, but I don't see much difference between having the check after the pthread_sigmask
or before. The expectation is that calling waitForExit
on a terminated process should not wait. If the process is not yet terminated when waitForExit
is invoked, we should wait (up to timeout
) for the process' termination.
Invoking running()
updates p
's internal state if terminated, and allows the code already there to handle processes that have already terminated.
Also, the fix seems to be incomplete. It doesn't make the situation worse, but there are still cases when wait() will hang. For example, when we have multiple threads, it is not guaranteed that we will receive SIGCHLD in the correct thread and we will hang because of this. Or, if we wait for more than one process simultaneously from multiple threads. Though, I haven't tried these scenarios myself and got these ideas just by looking at the code.
The std/osproc
library does not use wait
, it uses waitpid
which specifically requires a PID to wait on. Since PIDs are unique, I don't see how waitpid
will not receive a change in the process' state. It is also invoked with WNOHANG
which ensures it is non-blocking.
That was my initial approach, but I don't see much difference between having the check after the pthread_sigmask or before. The expectation is that calling waitForExit on a terminated process should not wait. If the process is not yet terminated when waitForExit is invoked, we should wait (up to timeout) for the process' termination.
If the process terminates between your check and blocking the signals, then we will wait for the whole timeout even if the process had already terminated, as in the original bug.
The std/osproc library does not use wait, it uses waitpid which specifically requires a PID to wait on. Since PIDs are unique, I don't see how waitpid will not receive a change in the process' state. It is also invoked with WNOHANG which ensures it is non-blocking.
The code path without timeout uses waitpid()
indeed. But, when there is a timeout, then waiting is implemented by blocking SIGCHLD
and waiting synchronously until we receive it. In this case, all my concerns above apply, and waiting for more than one process simultaneously will result in a bug.
If the process terminates between your check and blocking the signals, then we will wait for the whole timeout even if the process had already terminated, as in the original bug.
I understand, but your example assumes the process has not terminated before the invocation of waitForExit
which is different from the original bug. More explicitly, consider two processes, one running and one terminated:
var running: Process
var terminated: Process
An invocation waitForExit(terminated)
should return immediately. Meanwhile, an invocation waitForExit(running)
should wait up to timeout
. In your hypothetical, we are dealing with running
and not with terminated
at the point of invocation. So supposing that running
terminates between the check and the sigprocmask
(or the pthread_sigmask
) within waitForExit
is not the same as the original bug.
The code path without timeout uses
waitpid()
indeed. But, when there is a timeout, then waiting is implemented by blockingSIGCHLD
and waiting synchronously until we receive it. In this case, all my concerns above apply, and waiting for more than one process simultaneously will result in a bug.
I understand your concern, but I think it's only valid if we were using wait
. Since we are using waitpid
, even if a SIGCHLD
is broadcasted for a different process than the one we are waiting on, the subsequent invocation of waitpid
ensures we check that the SIGCHLD
we received is for the process we are waiting on. Furthermore, even supposing two or more processes terminated at the same time, subsequent invocations to waitpid
ensures that the PID we're waiting on has terminated.
I understand, but your example assumes the process has not terminated before the invocation of
waitForExit
which is different from the original bug.
Maybe the bug is a bit different, but the problem remains the same: if the process terminates right before sigprocmask
is called, then we will have to wait for the full timeout
instead of returning immediately after calling sigprocmask
.
I understand your concern, but I think it's only valid if we were using wait. Since we are using waitpid, even if a SIGCHLD is broadcasted for a different process than the one we are waiting on, the subsequent invocation of waitpid ensures we check that the SIGCHLD we received is for the process we are waiting on.
In case of timed waitForExit()
, we don't wait on waitpid()
, we use it in non-blocking mode only to ensure that the process has terminated. Instead, we block on sigtimedwait()
. If we don't receive SIGCHLD
because another thread has received it, then our sigtimedwait()
will block until the timeout expires. Thus, for example, if the program terminates in 1 second after calling waitForExit()
and timeout
is 1000 seconds, then we will wait for 1000 seconds instead of 1 because of missed SIGCHLD
. And no, waitpid()
doesn't save us from such a long wait, because we block on sigtimewait()
and cannot really unblock earlier because we have missed the signal.
Maybe the bug is a bit different, but the problem remains the same: if the process terminates right before
sigprocmask
is called, then we will have to wait for the fulltimeout
instead of returning immediately after callingsigprocmask
.
I think understand where your coming from. This current change is reaping the zombie processes too early. It's not the same issue, but it is inefficient to wait the whole timeout
milliseconds if we can easily not do that. By reaping just after blocking the signal, we can prevent a full timeout
wait. Moving the check just after sigprocmask
should address both issues.
In case of timed waitForExit(), we don't wait on waitpid(), we use it in non-blocking mode only to ensure that the process has terminated. Instead, we block on sigtimedwait(). If we don't receive SIGCHLD because another thread has received it, then our sigtimedwait() will block until the timeout expires. Thus, for example, if the program terminates in 1 second after calling waitForExit() and timeout is 1000 seconds, then we will wait for 1000 seconds instead of 1 because of missed SIGCHLD. And no, waitpid() doesn't save us from such a long wait, because we block on sigtimewait() and cannot really unblock earlier because we have missed the signal.
I think I may have misunderstood you. When you originally said "hang", I thought you meant it blocks indefinitely. Mea culpa!
I agree that we will wait timeout
if the SIGCHLD
is delivered to another thread than ours, but I don't see an optimal way around this on Linux. A naive approach is to busy wait on waitpid()
up to timeout
, but I'm not sure that's better.
Fixes #5091.
Ensure we don't wait on an exited process on Linux