oconnor663 / duct.rs

a Rust library for running child processes
MIT License
795 stars 34 forks source link

When the right half of a pipe fails to spawn, kill and await the left half? #75

Closed oconnor663 closed 4 years ago

oconnor663 commented 4 years ago

This is a tricky issue that I've been thinking about for a long time, and none of our options here are really ideal. To set the stage for this question, consider this line:

let handle = cmd!("foo").start()?;

What happens if foo fails to start here? Almost certainly, this needs to return an error. (Otherwise we wouldn't even use the ?.) foo might be a long-running background process, which the caller might never get around to awaiting, and it's really important that an obvious error like misspelling the command name doesn't pass silently. Ok so far.

Now, consider this caller:

let handle = cmd!("foo").pipe(cmd!("bar")).start()?;

If foo fails to start here, the situation is the same as before. We'll probably just return an error. But what if bar fails to start? In that case, foo is already running. If we return an error, whose job is it to await foo? We could try to stash the Handle for foo inside the error value itself, but realistically the vast majority of callers aren't going to write any special code to deal with this situation, and foo is going to become a zombie. It would be much better if duct itself could wait on foo to guarantee that it's done properly. But that raises the additional problem that foo isn't guaranteed to exit. Most programs will exit if they hit a pipe write error, but some programs will suppress that error, and others just might not try to write anything for a while. The best compromise is probably to explicitly kill foo and then wait on it. (The behavior of duct::Handle::kill and std::process::Child::kill is to send SIGKILL on Unix platforms, and that's what we would use.)

Now, this is going to be a controversial choice. I'm not happy about it. Killing a program that the user didn't explicitly ask us to kill is unexpected. Some programs try to catch signals like SIGTERM to perform cleanup, and they might misbehave under SIGKILL. But that said, not killing the child might be worse:

Consider the situations where this issue is likely to come up. Why would a child fail to spawn? There are two likely scenarios. One is the "simple mistake" scenario. Maybe the programmer misspelled the program name, or maybe the system is misconfigured and the program isn't installed in the right place. In this case, the important thing is that we return an error, so that the human can intervene and fix things. But the other is the "resource exhaustion" scenario. The system is under heavy load, and calls to spawn are returning errors because the PID space is full. In this case, it's very important that we don't leak zombie children and make the problem worse. If we want to avoid leaving a zombie, our only two choices are 1) wait for the child to exit on its own, or 2) kill the child. Waiting for the child to exit on its own is unacceptable; start isn't supposed to block, and blocking might create a deadlock by preventing the parent from doing some subsequent work that the child is itself blocked on, like closing an input pipe. Creating deadlocks in correct programs is even worse than leaking a zombie. Killing the child is our only other option. (Note that it's still possible for waiting to take a long time even after sending SIGKILL, for example if the child was in the middle of a long-running system call to something like a FUSE filesystem that might be blocked on the network. But when uninterruptible system calls are taking forever, it's not really our fault that we can't make progress, and I don't know of any workaround for that situation.)

Furthermore, in the modern world, every program has to expect that it might receive SIGKILL. The Linux OOM killer, for example, might kill any process it wishes when memory pressure gets bad enough. Or the power might just go out all of a sudden and bring down the entire machine. If a program has high correctness requirements, but it's known to behave incorrectly when it dies abruptly, then it's in a very tricky situation. It requires specific global OS configurations like disabling the OOM killer, and specific hardware like UPS units. A convenience library like duct isn't useful for that sort of software, and it doesn't make sense to design around it. SIGKILL-intolerant programs are much more likely to be that way because they simply have bugs, and it's better to make a buggy program worse than to make a correct program buggy.

(As a thought exercise, what hypothetical OS feature could we invent to save us in this scenario? It would have to be something like "transactional child spawning". Basically we'd need to start multiple child programs into some sort of paused-before-main state, such that all their executables had been found and whatever PID and memory allocations had been made, and then we could unpause them all together without the possibility that unpause might return an error. This would be kind of neat, but I can't think of a system that realistically needs this feature, and I can't imagine any OS really implementing it.)

So anyway, all that said, I'm leaning towards "kill foo and wait" as a strategy to allow start to always return an error immediately. I acknowledge that that could cause problems with some child programs, but I think those problems would be either the child program's "fault" (buggy cleanup, unawareness of the OOM killer) or the OS's "fault" (uninterruptible system calls that block forever) and not the sort of thing that duct should work around. But I'd like to get more feedback on this idea from more experienced system administrators. What would people like a pipe expression to do, when the right half fails to start but the left half doesn't exit?