Closed dtgriscom closed 2 years ago
Here is Laurent Bercot's response to my s6 mailing list query:
Hi Daniel,
I'm actually not the maintainer of s6-overlay: John is. I think the correct place to describe your issue is GitHub where s6-overlay is hosted.
I am aware that there is a race condition problem with zombies in the shutdown sequence of s6-overlay. This is not the first time it occurs (at some point broken kernels were also causing similar troubles, but this is probably not what is happening here).
For instance, I know that the line at https://github.com/just-containers/s6-overlay/blob/master/builder/overlay-rootfs/etc/s6/init/init-stage3#L53 is incorrect: s6-svwait cannot run correctly when the supervision tree has been torn down, which is the case in init-stage3. This is why the s6-svwait programs are waiting until they time out: even though the services they're waiting for are down, they're never triggered because the associated s6-supervise processes, which perform the triggers, are already dead.
Unfortunately, fixing this requires a significant rewrite of the s6-overlay shutdown sequence. I have started working on this, but it has been preempted by another project, and will likely not come out before
- I'm sorry; I would like to provide the correct shutdown sequence you're looking for (and that is entirely possible to achieve with s6) but as is, we have to make do with the current sequence.
A tweak I would try is replacing the whole foreground block at lines 48-55 with the following: (without a foreground block)
backtick -D 3000 -n S6_SERVICES_GRACETIME { printcontenv S6_SERVICES_GRACETIME } importas -u S6_SERVICES_GRACETIME S6_SERVICES_GRACETIME wait -t ${S6_SERVICES_GRACETIME} { }
This makes it so init-stage3 simply waits for all processes to die before continuing, instead of waiting for a trigger that will never come. It is not a long-term solution though, because having for instance a shell on your container will make the "wait" command block until it times out; but it may be helpful for your situation.
Please open a GitHub issue to discuss this.
Heads up: the next version of s6-overlay is almost ready and fixes this problem (among others).
v3.0.0.0 is out (the built tarballs aren't there yet, but the source is available and it's easy to build yourself). It should solve any zombie-related issue. Please reopen an issue if you are still having trouble.
Hello, all. I'm using s6 as the init process manager in a Docker container, using s6-overlay Everything's working fine, but I send a SIGINT to the container, the processes being managed exit, but they become zombies and aren't reaped, forcing the system to timeout (twice, actually).
I'm using ubuntu:20.04 as a container using s6-overlay amd64 version 2.2.0.3, which I believe has the latest s6. All runs on an Ubuntu 18.04 desktop system. It looks like s6-svscan sends SIGINT or SIGTERM to the processes, and then uses s6-svwait to wait for the processes to exit, but the zombie processes are never reaped.
I found the following reference that suggests the problem might be a kernel problem: https://github.com/just-containers/s6-overlay/issues/135 , although I'm not seeing the high zombie CPU usage referenced. I also found https://wiki.gentoo.org/wiki/S6 , which suggested that sending a SIGCHLD to s6-svscan, which should cause it to re-scan for zombies, didn't work.
Here are the processes once everything is started (viewed by "ps axl" after running bash in a separate connection to the container):
And, once I issue a SIGINT to the container, but before any timeout:
And, after the system times out and sends SIGTERM to all the processes:
Notes:
S6_SERVICES_GRACETIME
andS6_KILL_GRACETIME
to 10000 for the above testsforeground
command; perhaps it needs to check for and reap zombies?It would be easy to cut the timeouts to, say, 100ms each, but I'd much rather have a correct shutdown sequence, as that's why I switched to s6 and s6-overlay in the first place.
(FYI, I first posted this on the s6 mailing list, and Laurent suggested I post it here. He also gave some good information which I'll add to this issue as a comment.)