ramr / go-reaper

Process (Grim) reaper library for golang - this is useful for cleaning up zombie processes inside docker containers (which do not have an init process running as pid 1).
MIT License
120 stars 16 forks source link

Question: Does a race condition exist in the "Into the woods" implementation? #22

Closed ajones-miovision closed 1 year ago

ajones-miovision commented 1 year ago

I may be wrong here but is there a race condition present in the "Into the woods" implementation? We wait4 on the child pid we fork in the parent as a means to block so we don't kill pid1 until our forked process exits. Great. However, when that original child eventually dies, it is possible (dependent upon context switching within the go runtime) that the wait4 in reapChild (line 61) may handle that exit before the original wait4 in the main thread. In this case I think pid 1 won't ever exit because it will be waiting on a pid that had already reaped by the reaper. Thoughts?

ajones-miovision commented 1 year ago

Ok, so i wrote a small c program that tries to replicate the issue and it seems like both waits will return so my assumption that one would get stuck waiting forever is false. They will both return but only one will receive the exit status.

ramr commented 1 year ago

Cool - am glad you figured it out.

Yeah, in your example with the parent wait[4] code executing after the child exits (Note: exited but not terminated) ... the exit of the child process leaves it in a "waitable" state, so a subsequent wait* call would return the child status information.

The *nix rationale being something akin to a parent should be free to do some other work ("chores") whilst the child is playing and come back and clean up afterwards!

Note that if the child process is never wait[ed] on aka never cleaned up, you'll end up with a zombie process.

And also note that wait [wait{2,4} are wrappers] could return if the child process has changed "state" .. this is terminated in your case but it could well be a suspend-continuation workflow ... aka sig{stop,tstp,cont}. Hence the check for ECHILD if the syscall wasn't interrupted.