s6-linux-init-shutdown: fatal: unable to talk to shutdownd: Operation not permitted

colinmollenhour commented 2 years ago

Create a Dockerfile with one or more longrun services:

FROM ubuntu
ARG S6_OVERLAY_VERSION=3.0.0.2

RUN apt-get update && apt-get install -y nginx xz-utils
RUN echo "daemon off;" >> /etc/nginx/nginx.conf

ADD https://github.com/just-containers/s6-overlay/releases/download/v${S6_OVERLAY_VERSION}/s6-overlay-noarch-${S6_OVERLAY_VERSION}.tar.xz /tmp
RUN tar -C / -Jxpf /tmp/s6-overlay-noarch-${S6_OVERLAY_VERSION}.tar.xz
ADD https://github.com/just-containers/s6-overlay/releases/download/v${S6_OVERLAY_VERSION}/s6-overlay-x86_64-${S6_OVERLAY_VERSION}.tar.xz /tmp
RUN tar -C / -Jxpf /tmp/s6-overlay-x86_64-${S6_OVERLAY_VERSION}.tar.xz

RUN mkdir -p /etc/s6-overlay/s6-rc.d/nginx /etc/cont-init.d/ \
  && echo '#!/command/with-contenv sh\necho "Starting $FOO..."\nexec /usr/sbin/nginx' > /etc/s6-overlay/s6-rc.d/nginx/run \
  && echo "longrun" > /etc/s6-overlay/s6-rc.d/nginx/type \
  && touch /etc/s6-overlay/s6-rc.d/user/contents.d/nginx \
  && echo '#!/command/with-contenv sh\necho "Hello $FOO"' > /etc/cont-init.d/00-hello \
  && chmod +x /etc/cont-init.d/00-hello

ENV FOO=bar
ENTRYPOINT ["/init"]
CMD []

Build and run:

docker build . -t s6demo
docker run --rm --name s6demo -p 80:80 s6demo

Press Ctrl+C to kill the container. Run "echo $?" to get the return code.

Expected result:

Quick and clean shutdown with "0" return code.

Actual result

Shutdown displays a "fatal" error and takes about 3-4 seconds, return code is "111".

^Cs6-rc: info: service legacy-services: stopping
s6-rc: info: service legacy-services successfully stopped
s6-rc: info: service legacy-cont-init: stopping
s6-rc: info: service nginx: stopping
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service fix-attrs successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service nginx successfully stopped
s6-rc: info: service s6rc-oneshot-runner successfully stopped
s6-linux-init-shutdown: fatal: unable to talk to shutdownd: Operation not permitted

Comments

Maybe I'm missing something but this is not the result I desire/expect given that my services shut down instantly and cleanly. This extra delay is just unnecessary downtime when recreating containers with docker-compose for example and the return code indicates some sort of error even though there was none.

skarnet commented 2 years ago

The "fatal error" problem existed in 3.0.0.0, but has been fixed in 3.0.0.2. It's working for me with your exact Dockerfile, no fatal error message.
The exit code of 111 instead of 0 when you stop a container via ^C or docker stop is a known issue indeed, it will be fixed in the next version. (Already fixed in the git head, if you want to build s6-overlay from source.)
The 3 second pause before the container exits is normal: it's the grace time between the SIGTERM and the SIGKILL that are sent to all the processes in the container. You can reduce it via the S6_KILL_GRACETIME variable, which is the number of milliseconds that s6-overlay should wait: set it to something like 100 or 200 if you're certain that all your services are well-behaved and don't leave behind children that might take time to die when sent a SIGTERM.

kriansa commented 2 years ago

@skarnet regarding 3 - doesn't s6 check whether the process has finished before grace time, and then issue a SIGKILL before that?

skarnet commented 2 years ago

For supervised processes, yes, but that's not what S6_KILL_GRACETIME is about. (Yes, the name is confusing.) It's about the final kill of all the processes in the container right before the container exits. That includes unsupervised processes, and s6 has no way of knowing whether those finished on SIGTERM or not.

colinmollenhour commented 2 years ago

Thanks for the info @skarnet - so just to clarify, if say I have an Apache process that was already in progress when shutdown was issue that would take say 15 more seconds to complete before Apache shuts down gracefully, is the S6_KILL_GRACETIME going to cause that to get killed?

colinmollenhour commented 2 years ago

Testing with S6_KILL_GRACETIME=1 it seems to exit faster but the nginx process still says it stopped gracefully so I think that is a safe configuration? Still not sure why there needs to be a kill gracetime at all if it waits for each service to exit gracefully but I'm probably completely missing something.

The "fatal error" problem existed in 3.0.0.0, but has been fixed in 3.0.0.2. It's working for me with your exact Dockerfile, no fatal error message.

I'm definitely still getting that error message even with 3.0.0.2-2 using the release files from Github. How could that be? Could there be a difference between your local build and the release files?

[+] Building 61.3s (14/14) FINISHED
 => [internal] load build definition from Dockerfile                                                                                                           0.0s
 => => transferring dockerfile: 1.12kB                                                                                                                         0.0s 
 => [internal] load .dockerignore                                                                                                                              0.0s 
 => => transferring context: 2B                                                                                                                                0.0s 
 => [internal] load metadata for docker.io/library/ubuntu:latest                                                                                               0.3s 
 => https://github.com/just-containers/s6-overlay/releases/download/v3.0.0.2-2/s6-overlay-noarch-3.0.0.2-2.tar.xz                                              0.0s
 => CACHED [1/8] FROM docker.io/library/ubuntu@sha256:669e010b58baf5beb2836b253c1fd5768333f0d1dbcb834f7c07a4dc93f474be                                         0.0s 
 => https://github.com/just-containers/s6-overlay/releases/download/v3.0.0.2-2/s6-overlay-x86_64-3.0.0.2-2.tar.xz                                              0.0s 
 => [2/8] RUN apt-get update && apt-get install -y nginx xz-utils                                                                                             58.7s 
 => [3/8] RUN echo "daemon off;" >> /etc/nginx/nginx.conf                                                                                                      0.4s
 => [4/8] ADD https://github.com/just-containers/s6-overlay/releases/download/v3.0.0.2-2/s6-overlay-noarch-3.0.0.2-2.tar.xz /tmp                               0.0s
 => [5/8] RUN tar -C / -Jxpf /tmp/s6-overlay-noarch-3.0.0.2-2.tar.xz                                                                                           0.5s
 => [6/8] ADD https://github.com/just-containers/s6-overlay/releases/download/v3.0.0.2-2/s6-overlay-x86_64-3.0.0.2-2.tar.xz /tmp                               0.0s
 => [7/8] RUN tar -C / -Jxpf /tmp/s6-overlay-x86_64-3.0.0.2-2.tar.xz                                                                                           0.5s
 => [8/8] RUN mkdir -p /etc/s6-overlay/s6-rc.d/nginx /etc/cont-init.d/   && echo '#!/command/with-contenv sh\necho "Starting $FOO..."\nexec /usr/sbin/nginx'   0.5s
 => exporting to image                                                                                                                                         0.4s 
 => => exporting layers                                                                                                                                        0.4s 
 => => writing image sha256:ce4db5a3a719911423f680552326af9ff0bd182fc9114d64cd47e674c97f69aa                                                                   0.0s 
 => => naming to docker.io/library/s6demo                                                                                                                      0.0s 

Use 'docker scan' to run Snyk tests against images to find vulnerabilities and learn how to fix them
s6-rc: info: service nginx: starting
s6-rc: info: service s6rc-oneshot-runner: starting
s6-rc: info: service nginx successfully started
Starting bar...
s6-rc: info: service s6rc-oneshot-runner successfully started
s6-rc: info: service fix-attrs: starting
s6-rc: info: service fix-attrs successfully started
s6-rc: info: service legacy-cont-init: starting
cont-init: info: running /etc/cont-init.d/00-hello
Hello bar
cont-init: info: /etc/cont-init.d/00-hello exited 0
s6-rc: info: service legacy-cont-init successfully started
s6-rc: info: service legacy-services: starting
s6-rc: info: service legacy-services successfully started
^Cs6-rc: info: service legacy-services: stopping
s6-rc: info: service legacy-services successfully stopped
s6-rc: info: service legacy-cont-init: stopping
s6-rc: info: service nginx: stopping
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service fix-attrs successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service nginx successfully stopped
s6-rc: info: service s6rc-oneshot-runner successfully stopped
s6-linux-init-shutdown: fatal: unable to talk to shutdownd: Operation not permitted

skarnet commented 2 years ago

S6_KILL_GRACETIME does not impact supervised services, only processes that are still alive at the end of the container's lifetime. If your nginx is being supervised, it is safe, it will be stopped cleanly. (There is also a timeout after which it gets a SIGKILL if it's not dead yet, but it's 5 seconds by default, you can change it by writing a timeout-kill file in the service directory if needed, and it has no influence on the global death time.) If you are only using supervised services that you know don't leave behind unattended and misbehaved children, then you can safely have a very small S6_KILL_GRACETIME to minimize the waiting time at the end.

The only way you're getting an "unable to talk to shutdownd" error is when your ^C sends a SIGINT to the supervision tree's process group and kills the whole tree, which should definitely not happen if your s6-overlay has been built with s6-linux-init version 1.0.7.1 or later. I will do more testing before the next release, and make sure it works, but something seems definitely not right. As a workaround in the meantime, if you kill your containers with docker stop instead, you should not be getting that error.

skarnet commented 2 years ago

Testing with the latest version, which is on track to become 3.1.0.0 (you can check for yourself if you build s6-overlay from source):

definitely no s6-linux-init-shutdown: fatal: unable to talk to shutdownd: Operation not permitted error
docker stop makes the container exit 0
^C makes the container exit 111. I prefer to keep that one as is, because ^C is an interruption, not an official way of stopping a container, especially one without a CMD; even if it's a convenient shortcut for testing, I don't think it's a good idea to elevate ^C to "one keypress docker stop" status.

colinmollenhour commented 2 years ago

Thanks for the updates, makes sense!

tyranron commented 11 months ago

@skarnet

Seems like with 3.1.6.1 release I hit this again:

Dockerfile

docker run --rm --pull never \
    -e PURE_PASSWDFILE=/tmp/pureftpd.passwd \
    -e PURE_DBFILE=/pureftpd.pdb \
    -v $(pwd)/tests/resources/pureftpd.passwd:/tmp/pureftpd.passwd:ro \
    instrumentisto/pure-ftpd:1.0.51-r20 test -f /pureftpd.pdb

outputs:

   s6-rc: info: service syslog: starting
   s6-rc: info: service s6rc-oneshot-runner: starting
   s6-rc: info: service syslog successfully started
   s6-rc: info: service s6rc-oneshot-runner successfully started
   Nov 20 12:21:34 7922563421b8 syslog.info syslogd started: BusyBox v1.36.1
   s6-rc: info: service fix-attrs: starting
   s6-rc: info: service create-puredb: starting
   s6-rc: info: service create-puredb successfully started
   s6-rc: info: service fix-attrs successfully started
   s6-rc: info: service legacy-cont-init: starting
   s6-rc: info: service legacy-cont-init successfully started
   s6-rc: info: service legacy-services: starting
   s6-rc: info: service legacy-services successfully started
   s6-rc: info: service legacy-services: stopping
   s6-rc: info: service legacy-services successfully stopped
   s6-rc: info: service legacy-cont-init: stopping
   s6-rc: info: service create-puredb: stopping
   s6-rc: info: service syslog: stopping
   Nov 20 12:21:34 7922563421b8 syslog.info syslogd exiting
   s6-rc: info: service create-puredb successfully stopped
   s6-rc: info: service syslog successfully stopped
   s6-rc: info: service legacy-cont-init successfully stopped
   s6-rc: info: service fix-attrs: stopping
   s6-rc: info: service fix-attrs successfully stopped
   s6-rc: info: service s6rc-oneshot-runner: stopping
   s6-rc: info: service s6rc-oneshot-runner successfully stopped
   s6-linux-init-shutdown: fatal: unable to talk to shutdownd: Operation not permitted
   s6-linux-init-shutdown: fatal: unable to talk to shutdownd: Operation not permitted

On 3.1.6.0, though, everything is totally OK.

skarnet commented 11 months ago

Confirmed. Working on it.

just-containers / s6-overlay