Handle SIGCHLD correctly

a1291762 commented 1 year ago

SIGCHLD is not guaranteed to be sent for all children, especially if 2 children die at the same time.

Instead of trying to find the specific child that died, just reap all of the async processes that have finished running.

This removes a hang observed when downloading an image over the network. 5 wget processes are started, and some of them finished close enough together that only a single SIGCHLD was delivered to pb-discover.

Signed-off-by: Lincoln Ramsay lincoln.ramsay@opengear.com

jk-ozlabs commented 1 year ago

Sounds good. Since we're no longer reading the pid from sigchld_pipe, we should probably drain this separately, so the SIGCHLD handler doesn't (eventually) block on a write.

This may warrant changing what we're actually writing to the pipe, since we no longer need the pid...

a1291762 commented 1 year ago

Hmm. Good point about the pipe. I'll restore the read to this patch.

I won't change the write/read of a pid_t since I doubt it makes much difference (compared to say, a single char).

jk-ozlabs commented 1 year ago

Looks good; just one open item though: the read() will consume one write() from the signal handler at a time (and trigger a waitpid() on all current children). We might have multiple write()s happen before we enter sigchld_pipe_event, and so more than one read() to perform.

Overall, this just means we'll re-enter sigchld_pipe_event after the next poll - we could avoid that by totally draining the sigchld_pipe instead of doing the single read.

I'm okay with either option - but if you prefer the drain, I can wait for an update.

a1291762 commented 1 year ago

Hmm... Yeah, I see what you mean. I guess it is rare to enter sigchld_pipe_event after multiple writes anyway? I think we only see it because we have a single, slow, ARM core are we're spawning several downloads in parallel.

To be honest, I'm not sure how much point there is in trying to avoid calling this function when there's nothing to reap. Maybe something "best effort" like this?

    /* Drain the pipe (there may be multiple queued pids) */
    do {
        rc = read(procset->sigchld_pipe[0], &pid, sizeof(pid));
    } while (rc == sizeof(pid));

This has no error checking, but the previous code didn't have error checking either. In the event of a transient error (eg. EAGAIN) there will be another pipe event because the pipe still contains data to read. The primary difference here is that child reaping happens on this iteration rather than the next one. Even on our slow ARM core, I wouldn't expect to see an observable performance delta due to this change but maybe it makes sense from a "correctness" perspective?

jk-ozlabs commented 1 year ago

Yeah, it's all fairly trivial anyway; there's no harm in calling sigchld_pipe_event multiple times, aside from causing a few extra syscalls.

Your looping read will end up blocking though - if we do anything about this, maybe just a larger buffer size, and permit shorter reads?

I'm fine with the current code though - let me know if you'd prefer a merge as-is.

a1291762 commented 1 year ago

Your looping read will end up blocking though

Ha! Well that's why when I built this and threw it on my device, the boot menu didn't populate any entries... :facepalm:

If the read is going to block, I'd suggest just merging as-is. It works, even if it may cause the pipe handler a few more times than it needs to.

jk-ozlabs commented 1 year ago

Sounds good!

Merged to my tree, will push shortly. Thanks for the contribution.

open-power / petitboot

Handle SIGCHLD correctly #95