fixes #5091; Ensure we don't wait on an exited process on Linux

nim-lang / Nim

Nim is a statically typed compiled systems programming language. It combines successful concepts from mature languages like Python, Ada and Modula. Its design focuses on efficiency, expressiveness, and elegance (in that order of priority).

https://nim-lang.org

Other

16.22k stars 1.46k forks source link

fixes #5091; Ensure we don't wait on an exited process on Linux #23743

Open maleyva1 opened 1 week ago

maleyva1 commented 1 week ago

Fixes #5091.

Ensure we don't wait on an exited process on Linux

alex65536 commented 4 days ago

I think that the proposed fix is kinda racy, because if the process is terminated between your check and blocking the signals, then we will not receive SIGCHLD for it and wait until the timeout expires.

A better approach would be to check for running() after SIGCHLD is blocked, I think.

alex65536 commented 4 days ago

Also, the fix seems to be incomplete. It doesn't make the situation worse, but there are still cases when wait() will hang. For example, when we have multiple threads, it is not guaranteed that we will receive SIGCHLD in the correct thread and we will hang because of this. Or, if we wait for more than one process simultaneously from multiple threads. Though, I haven't tried these scenarios myself and got these ideas just by looking at the code.

maleyva1 commented 3 days ago

I think that the proposed fix is kinda racy, because if the process is terminated between your check and blocking the signals, then we will not receive SIGCHLD for it and wait until the timeout expires.

A better approach would be to check for running() after SIGCHLD is blocked, I think.

That was my initial approach, but I don't see much difference between having the check after the pthread_sigmask or before. The expectation is that calling waitForExit on a terminated process should not wait. If the process is not yet terminated when waitForExit is invoked, we should wait (up to timeout) for the process' termination.

Invoking running() updates p's internal state if terminated, and allows the code already there to handle processes that have already terminated.

Also, the fix seems to be incomplete. It doesn't make the situation worse, but there are still cases when wait() will hang. For example, when we have multiple threads, it is not guaranteed that we will receive SIGCHLD in the correct thread and we will hang because of this. Or, if we wait for more than one process simultaneously from multiple threads. Though, I haven't tried these scenarios myself and got these ideas just by looking at the code.

The std/osproc library does not use wait, it uses waitpid which specifically requires a PID to wait on. Since PIDs are unique, I don't see how waitpid will not receive a change in the process' state. It is also invoked with WNOHANG which ensures it is non-blocking.

alex65536 commented 3 days ago

That was my initial approach, but I don't see much difference between having the check after the pthread_sigmask or before. The expectation is that calling waitForExit on a terminated process should not wait. If the process is not yet terminated when waitForExit is invoked, we should wait (up to timeout) for the process' termination.

If the process terminates between your check and blocking the signals, then we will wait for the whole timeout even if the process had already terminated, as in the original bug.

The std/osproc library does not use wait, it uses waitpid which specifically requires a PID to wait on. Since PIDs are unique, I don't see how waitpid will not receive a change in the process' state. It is also invoked with WNOHANG which ensures it is non-blocking.

The code path without timeout uses waitpid() indeed. But, when there is a timeout, then waiting is implemented by blocking SIGCHLD and waiting synchronously until we receive it. In this case, all my concerns above apply, and waiting for more than one process simultaneously will result in a bug.

maleyva1 commented 2 days ago

If the process terminates between your check and blocking the signals, then we will wait for the whole timeout even if the process had already terminated, as in the original bug.

I understand, but your example assumes the process has not terminated before the invocation of waitForExit which is different from the original bug. More explicitly, consider two processes, one running and one terminated:

var running: Process
var terminated: Process

An invocation waitForExit(terminated) should return immediately. Meanwhile, an invocation waitForExit(running) should wait up to timeout. In your hypothetical, we are dealing with running and not with terminated at the point of invocation. So supposing that running terminates between the check and the sigprocmask (or the pthread_sigmask) within waitForExit is not the same as the original bug.

The code path without timeout uses waitpid() indeed. But, when there is a timeout, then waiting is implemented by blocking SIGCHLD and waiting synchronously until we receive it. In this case, all my concerns above apply, and waiting for more than one process simultaneously will result in a bug.

I understand your concern, but I think it's only valid if we were using wait. Since we are using waitpid, even if a SIGCHLD is broadcasted for a different process than the one we are waiting on, the subsequent invocation of waitpid ensures we check that the SIGCHLD we received is for the process we are waiting on. Furthermore, even supposing two or more processes terminated at the same time, subsequent invocations to waitpid ensures that the PID we're waiting on has terminated.

alex65536 commented 11 hours ago

I understand, but your example assumes the process has not terminated before the invocation of waitForExit which is different from the original bug.

Maybe the bug is a bit different, but the problem remains the same: if the process terminates right before sigprocmask is called, then we will have to wait for the full timeout instead of returning immediately after calling sigprocmask.

I understand your concern, but I think it's only valid if we were using wait. Since we are using waitpid, even if a SIGCHLD is broadcasted for a different process than the one we are waiting on, the subsequent invocation of waitpid ensures we check that the SIGCHLD we received is for the process we are waiting on.

In case of timed waitForExit(), we don't wait on waitpid(), we use it in non-blocking mode only to ensure that the process has terminated. Instead, we block on sigtimedwait(). If we don't receive SIGCHLD because another thread has received it, then our sigtimedwait() will block until the timeout expires. Thus, for example, if the program terminates in 1 second after calling waitForExit() and timeout is 1000 seconds, then we will wait for 1000 seconds instead of 1 because of missed SIGCHLD. And no, waitpid() doesn't save us from such a long wait, because we block on sigtimewait() and cannot really unblock earlier because we have missed the signal.

maleyva1 commented 48 minutes ago

Maybe the bug is a bit different, but the problem remains the same: if the process terminates right before sigprocmask is called, then we will have to wait for the full timeout instead of returning immediately after calling sigprocmask.

I think understand where your coming from. This current change is reaping the zombie processes too early. It's not the same issue, but it is inefficient to wait the whole timeout milliseconds if we can easily not do that. By reaping just after blocking the signal, we can prevent a full timeout wait. Moving the check just after sigprocmask should address both issues.

In case of timed waitForExit(), we don't wait on waitpid(), we use it in non-blocking mode only to ensure that the process has terminated. Instead, we block on sigtimedwait(). If we don't receive SIGCHLD because another thread has received it, then our sigtimedwait() will block until the timeout expires. Thus, for example, if the program terminates in 1 second after calling waitForExit() and timeout is 1000 seconds, then we will wait for 1000 seconds instead of 1 because of missed SIGCHLD. And no, waitpid() doesn't save us from such a long wait, because we block on sigtimewait() and cannot really unblock earlier because we have missed the signal.

I think I may have misunderstood you. When you originally said "hang", I thought you meant it blocks indefinitely. Mea culpa!

I agree that we will wait timeout if the SIGCHLD is delivered to another thread than ours, but I don't see an optimal way around this on Linux. A naive approach is to busy wait on waitpid() up to timeout, but I'm not sure that's better.