Closed faern closed 7 years ago
Oh cool! Agreed that they're very similar use cases. Some scattered thoughts along the lines you mentioned:
shared_child
talks about in its README, the same one that motivated the standard lib to put &mut
on those methods. I think you either have to call libc::waitid
or handle SIGCHLD
to work around that race. But it's a pretty wacky race condition, so I could see some projects being ok with the idea that it'll probably never happen. (I think it's only really likely if you've got ~32k processes running at the same time, so that any free PID is likely to get reused immediately. But then you probably have bigger problems with not being able to spawn children at all :p)kill
function can send a kill signal even if the process was waited on a long time ago. In the branch where it doesn't get the mutex, it looks like it doesn't check whether the process is still alive. (In the branch where it calls Child::kill
, that method will check for you though, as long as you keep using Child::wait
to do the waiting.) That's much more likely to go wrong than the race condition above, though I think you can fix it by just setting a flag after wait succeeds, and checking it before you kill. (shared_child
keeps the Exited
state for this reason.)
DuplicateHandle
, and I think the original is going to get closed after wait
succeeds, kind of like a file descriptor would?kill
twice from two different threads at the same time. The second kill might fail to acquire the lock (because the first kill is holding it for a split second), and assume it conflicting with a wait, when in fact it's not. Sending two signals in parallel might be ok, but it seems like it's probably not what you meant to happen. shared_child
decided to use Condvar
instead of try_lock
, to avoid precisely this sort of issue where a simultaneous kill is indistinguishable from a blocking wait.WouldBlock
case and the Poisoned
case when you call try_lock
. The latter should probably result in a panic?stdin
/stdout
/stderr
methods take &mut self
, but I don't think they need to?Actually I'm not sure setting a "this child has already exited flag" will be good enough in your API. I could call new
with a child that's already been waited on, and the CloneableChild
would then be in the wrong state. Triggering the custom kill
path (like by calling kill on two threads at once) would then always try to kill the PID, even if there was normally bookkeeping in place to prevent that.
I just shipped some changes to always use Child::wait
and Child::kill
under the hood, to avoid a similar issue after the child is returned from my into_inner
. (Though I still avoid accepting children from the caller -- instead I spawn
them myself.)
Thanks for the detailed explanation. At first I thought about patching up clonablechild
as well, but I realized that was not the best solution. The best IMO would be to not even have two crates solve the same, very small, problem. So I yanked all versions of clonablechild
and added a deprecation notice in the README.
Just out of curiosity, you never saw clonablechild
before publishing this, or what made you create a new crate instead of improving the existing one?
Sadly I never saw it. There was some discussion about the problem on the try_wait
tracking issue, and I guess I assumed that if anyone on that thread knew about an existing crate they would've mentioned it there.
Will close for now, but let me know if anything changes. I'd be happy to work together to port fixes.
Hi,
Just saw this crate, and notice that it is virtually doing the same as my
clonablechild
crate. Just wanting to check in if there is any difference that I miss?Since it's a very isolated and small problem it should more or less only be one correct solution I think. Thus we could maybe combine forces? If your library does stuff better I will remove mine.
I also put the standard
Child
in aMutex
. But where you seem to use a special procedure to wait I use the built in wait. And where you use the built in kill method I have my own kill implementation for when theMutex
is already taken.I probably have the reused pid race condition you talk about in your README.