Intermittent error in startup sequence: `s6-svwait: fatal: unable to s6_svstatus_read: No such file or directory`

nocive commented 2 years ago

Since we upgraded to 3.x we started observing some seemingly random errors during the startup sequence of some of our applications. The error symptoms are as follows:

s6-rc: info: service legacy-cont-init successfully started
s6-rc: info: service legacy-services: starting
services-up: info: copying legacy longrun http-proxy (no readiness notification)
services-up: info: copying legacy longrun php-fpm (no readiness notification)
s6-svwait: fatal: unable to s6_svstatus_read: No such file or directory
services-up: warning: some legacy longruns failed to start
s6-rc: warning: unable to start service legacy-services: command exited 1
/run/s6/basedir/scripts/rc.init: warning: s6-rc failed to properly bring all the services up! Check your logs (in /run/uncaught-logs/current if you have in-container logging) for more information.
prog: fatal: stopping the container.
s6-rc: info: service legacy-cont-init: stopping
[30-Jun-2022 11:08:27] NOTICE: fpm is running, pid 128
[30-Jun-2022 11:08:27] NOTICE: ready to handle connections
s6-rc: info: service legacy-cont-init successfully stopped
s6-rc: info: service fix-attrs: stopping
s6-rc: info: service fix-attrs successfully stopped
s6-rc: info: service s6rc-oneshot-runner: stopping
s6-rc: info: service s6rc-oneshot-runner successfully stopped
[30-Jun-2022 11:08:27] NOTICE: Terminating ...
[30-Jun-2022 11:08:27] NOTICE: exiting, bye-bye!
s6-linux-init-hpr: fatal: unable to talk to shutdownd: Operation not permitted
s6-linux-init-hpr: fatal: unable to talk to shutdownd: Operation not permitted

This is a container that runs a legacy cont-init layer and two services via the legacy services.d layer (nginx + php-fpm). Other s6 environment details:

S6_CMD_WAIT_FOR_SERVICES=1
S6_CMD_WAIT_FOR_SERVICES_MAXTIME=30000
S6_BEHAVIOUR_IF_STAGE2_FAILS=2
S6_LOGGING=0

Unfortunately I don't have an isolated reproduction scenario that I can provide, the issue is also happening intermittently for us and I haven't found a way to consistently reproduce it. We can only observe it during our CI pipelines, despite using the exact same docker containers that we use everywhere else.

I'm mostly posting this looking for hints or clues as to what might be going on.

skarnet commented 2 years ago

Ah, that's annoying. Unfortunately, with /etc/services.d services there is an unavoidable race condition. We try to mitigate it here with a 5ms sleep, but in your case it seems like it is not enough.

Can you locally patch s6-overlay to change the line to s6-sleep -m 10 and see if it fixes your issue?

I suppose in the future I will have to make that sleep duration configurable, so that people can make /etc/services.d work even on very slow machines.

felipecrs commented 2 years ago

I have noticed the same issue in Home Assistant Core since S6-Overlay was updated to version 3. Also noticed a few Home Assistant add-ons too. Intermittently this issue happens.

An option to make it configurable would be the best, @skarnet. Looking forward to it.

/cc @frenck, @pvizeli

nocive commented 2 years ago

@skarnet thanks for the quick reply.

I can definitely try that, although it will take some time to confirm whether it helps or not.

But so, you're saying that using the new rc.d layer would not lead to this issue? I found the migration to the new layer to not be as straightforward as I'd hoped, but I suppose this alone is worth the try.

skarnet commented 2 years ago

@nocive Yes, if you convert your services to s6-rc, the race condition goes away. Solving this was one of the first things that went into the s6-rc spec. ^^'

nocive commented 2 years ago

Ok, thanks! I think I'll try that instead, otherwise I might be banished to the shadow realm by my peers if I propose something that monkey patches a s6 file and changes an arbitrary sleep :sweat_smile:

In all seriousness, it's probably for the best to switch to the new layer anyway, it was just not at the top of the priority list :)

Thanks for all your amazing work! <3

felipecrs commented 2 years ago

BTW refactoring cont-init, services and cont-finish to s6-rc isn't feasible for Home Assistant's case. In the new s6-rc format, every service must declare a dependency on their one-shots. But Home Assistant ships a set of base images which comes with a couple of cont-init.d scripts. Those scripts are run no matter what there are services depending on them or not.

Converting them to s6-rc would require everyone who adds their own services on top of the base images to be aware of every oneshot and explicitly mark their services as dependents on them.

This is unintended, since we want those cont-init scripts to always run.

Out of curiosity, this was the reason why we didn't convert our legacy scripts to s6-rc while upgrading to v3.

Refs https://github.com/hassio-addons/addon-base/pull/176#issuecomment-1145296637

That said, here are my 2-cents:

Would it be possible to add a new feature to s6-rc that would allow oneshots to run independently (whether there are services depending on them or not), and ensuring that all services will wait for those independent oneshots before running? (i.e. the current cont-init.d and services.d approach)?
If not, then migrating to s6-rc may not be feasible for everyone. Therefore, providing a way to fix this reported issue would still be important.

Thank you, as always, for S6-Overlay!

skarnet commented 2 years ago

In order to always run oneshots no matter what depends on them, list them in the user bundle. If you're a distribution based on s6-overlay, you can probably list them in the top bundle instead, that is defined in /package/admin/s6-overlay/s6-rc/sources/top. Make sure that these oneshots depend on base, so they're always run after fix-attrs and cont-init.
Version 3.1.1.2 of s6-overlay has a new user2 bundle that is started late, after user and legacy-services. If you have services that need to wait until everything else is started, that's a place where you can add them. Don't overuse it, because there won't be any user3 if you need one. :-)
Version 3.1.1.2 also comes with a very big hammer for distributions, in the form of the S6_STAGE2_HOOK environment variable, read what it does at the bottom of the list. The idea is that if a distribution needs to preprocess the source definition directories of services, they can run a preprocessor there. You could use this to automatically add dependencies from all services to all the oneshots your distribution provides, even if users do not declare them.

felipecrs commented 2 years ago

Oh wow! Thanks a lot!

skarnet commented 2 years ago

I've added an S6_SERVICES_READYTIME variable to the latest s6-overlay git head.

@nocive I certainly don't want to deter you from converting your services to s6-rc :-) but if you can build s6-overlay from source, could you tell me if using that variable (with a greater value than 5) solves your issue?

nocive commented 2 years ago

Oops, sorry, I missed all the activity going on here :-)

Unfortunately I immediately jumped into the s6-rc bandwagon and everything has been converted to the new format and has been working like a charm since. I'm afraid I no longer have an environment where I can reproduce the reported issue :no_mouth:

skarnet commented 2 years ago

It's fine, converting to s6-rc is a good move no matter what. 😄 I'm closing this then.

felipecrs commented 1 year ago

@skarnet in the environment where this issue occurs to me, it's very rare, therefore it's difficult to test. I'm trying to find a way to replicate the issue so that I can test if increasing S6_SERVICES_READYTIME indeed fixes the issue or not.

I have tried several things, without success (i.e. I'm not able to replicate the issue reliably). Do you have any idea on how this race condition can be exercised?

@nocive I'm afraid I no longer have an environment where I can reproduce the reported issue 😶

By any chance do you have any hint on what made your environment reproduce the issue?

skarnet commented 1 year ago

@felipecrs The only way to exercise the race condition is to be unlucky with the scheduler's decisions. You can probably increase the frequency by overloading the machine, i.e. making it difficult for a given process to get a CPU timeslice.

The race condition comes from the fact that s6-svwait accesses control files that are created by s6-supervise - in other words, it can only work on a service that is already being supervised.

With the s6-rc approach, all the s6-supervise processes are created before service transitions are attempted, so it always works.

With the /etc/services.d approach, the supervisors do not preexist - they are started at the same time as the services. So when you wait for the service to be ready, there's a slim chance, if the scheduler is whimsical enough, that s6-svwait will run before the relevant supervisor has obtained a timeslice and created the necessary control files. Increasing the sleep duration before s6-svwait gives more opportunities to the scheduler to properly get the supervisors up to speed.

The optimal sleep duration depends on your machine's load, really. If you expect CPU time to be at a premium, increase the duration by a significant amount.

felipecrs commented 1 year ago

@skarnet thanks a lot for the very detailed explanation. I'll conduct some tests based on this, and will share the result as well.

felipecrs commented 1 year ago

Ok, this is really complicated. I tried to stress my CPU using the stress with s-tui, but it does not seem to make any difference, I think.

Sorry I had to upload the videos somewhere else because they are too big for GitHub.

S6_SERVICE_READY=1

Multiple failures after 1 minute of video. I was able to reproduce the failure several times offline as well.

https://1drv.ms/v/s!AntpCaY1YjnMi6ItGFtMq9wgcVb4Nw?e=RoLwsh

S6_SERVICE_READY=50

No failures, almost 5 minutes of testing. I wasn't able to catch a failure at all using S6_SERVICES_READY=50.

https://1drv.ms/v/s!AntpCaY1YjnMi6ItGFtMq9wgcVb4Nw?e=ZdP36s

Conclusion

I know the testing environment is far from ideal, but I think this is already a very good signal.

felipecrs commented 1 year ago

I wrote a simple script to test it for me:

#!/bin/bash

function test() {
    rm -f "$1"-log.txt
    local attempts=500
    local failures=0
    for i in $(seq 1 $attempts); do
        clear
        echo "S6_SERVICES_READYTIME=${1} - Attempt $i of $attempts"
        set -x
        docker rm -f test
        docker run --name test --env=S6_SERVICES_READYTIME="$1" --rm -d ghcr.io/home-assistant/generic-x86-64-homeassistant:2022.11.4
        sleep 1.25s
        set +x
        if docker logs test 2>&1 |
            tee /dev/stderr |
            grep -q "s6-svwait: fatal: unable to s6_svstatus_read: No such file or directory"; then
            echo "attempt ${i} at $(date)" >> "$1"-log.txt
            failures=$((failures + 1))
            # Bell
            echo -e '\a'
        fi
    done
    echo "Failed ${failures} out of ${attempts} attempts for S6_SERVICES_READYTIME=${1}" >> result.txt
}

test 0

test 5

test 50

Here's the result:

Failed 43 out of 500 attempts for S6_SERVICES_READYTIME=0
Failed 48 out of 500 attempts for S6_SERVICES_READYTIME=5
Failed 0 out of 500 for S6_SERVICES_READYTIME=50

Failures for S6_SERVICES_READYTIME=0:

attempt 1 at Thu Nov 24 20:17:48 -03 2022
attempt 7 at Thu Nov 24 20:18:03 -03 2022
attempt 9 at Thu Nov 24 20:18:08 -03 2022
attempt 12 at Thu Nov 24 20:18:16 -03 2022
attempt 13 at Thu Nov 24 20:18:19 -03 2022
attempt 53 at Thu Nov 24 20:20:13 -03 2022
attempt 58 at Thu Nov 24 20:20:28 -03 2022
attempt 80 at Thu Nov 24 20:21:35 -03 2022
attempt 93 at Thu Nov 24 20:22:44 -03 2022
attempt 114 at Thu Nov 24 20:23:51 -03 2022
attempt 115 at Thu Nov 24 20:23:53 -03 2022
attempt 119 at Thu Nov 24 20:24:03 -03 2022
attempt 135 at Thu Nov 24 20:24:56 -03 2022
attempt 156 at Thu Nov 24 20:25:50 -03 2022
attempt 177 at Thu Nov 24 20:26:43 -03 2022
attempt 202 at Thu Nov 24 20:27:47 -03 2022
attempt 205 at Thu Nov 24 20:27:55 -03 2022
attempt 215 at Thu Nov 24 20:28:21 -03 2022
attempt 228 at Thu Nov 24 20:28:53 -03 2022
attempt 231 at Thu Nov 24 20:29:15 -03 2022
attempt 232 at Thu Nov 24 20:29:31 -03 2022
attempt 235 at Thu Nov 24 20:29:38 -03 2022
attempt 240 at Thu Nov 24 20:29:51 -03 2022
attempt 250 at Thu Nov 24 20:30:17 -03 2022
attempt 259 at Thu Nov 24 20:30:40 -03 2022
attempt 282 at Thu Nov 24 20:32:08 -03 2022
attempt 296 at Thu Nov 24 20:32:44 -03 2022
attempt 330 at Thu Nov 24 20:34:11 -03 2022
attempt 349 at Thu Nov 24 20:34:59 -03 2022
attempt 352 at Thu Nov 24 20:35:07 -03 2022
attempt 359 at Thu Nov 24 20:35:28 -03 2022
attempt 398 at Thu Nov 24 20:37:22 -03 2022
attempt 411 at Thu Nov 24 20:37:55 -03 2022
attempt 417 at Thu Nov 24 20:38:11 -03 2022
attempt 425 at Thu Nov 24 20:38:31 -03 2022
attempt 428 at Thu Nov 24 20:38:39 -03 2022
attempt 434 at Thu Nov 24 20:38:54 -03 2022
attempt 441 at Thu Nov 24 20:39:12 -03 2022
attempt 444 at Thu Nov 24 20:39:19 -03 2022
attempt 461 at Thu Nov 24 20:40:02 -03 2022
attempt 463 at Thu Nov 24 20:40:07 -03 2022
attempt 477 at Thu Nov 24 20:40:43 -03 2022
attempt 497 at Thu Nov 24 20:41:35 -03 2022

Failures for S6_SERVICES_READYTIME=5:

attempt 16 at Thu Nov 24 21:13:33 -03 2022
attempt 21 at Thu Nov 24 21:13:46 -03 2022
attempt 31 at Thu Nov 24 21:14:14 -03 2022
attempt 37 at Thu Nov 24 21:14:30 -03 2022
attempt 38 at Thu Nov 24 21:14:32 -03 2022
attempt 54 at Thu Nov 24 21:15:16 -03 2022
attempt 58 at Thu Nov 24 21:15:26 -03 2022
attempt 59 at Thu Nov 24 21:15:29 -03 2022
attempt 60 at Thu Nov 24 21:15:32 -03 2022
attempt 63 at Thu Nov 24 21:15:40 -03 2022
attempt 65 at Thu Nov 24 21:15:45 -03 2022
attempt 85 at Thu Nov 24 21:16:40 -03 2022
attempt 86 at Thu Nov 24 21:16:42 -03 2022
attempt 96 at Thu Nov 24 21:17:09 -03 2022
attempt 99 at Thu Nov 24 21:17:17 -03 2022
attempt 104 at Thu Nov 24 21:17:29 -03 2022
attempt 110 at Thu Nov 24 21:17:44 -03 2022
attempt 112 at Thu Nov 24 21:17:49 -03 2022
attempt 116 at Thu Nov 24 21:17:59 -03 2022
attempt 124 at Thu Nov 24 21:18:20 -03 2022
attempt 132 at Thu Nov 24 21:18:41 -03 2022
attempt 143 at Thu Nov 24 21:19:27 -03 2022
attempt 154 at Thu Nov 24 21:20:12 -03 2022
attempt 163 at Thu Nov 24 21:20:36 -03 2022
attempt 202 at Thu Nov 24 21:22:38 -03 2022
attempt 231 at Thu Nov 24 21:24:01 -03 2022
attempt 239 at Thu Nov 24 21:24:22 -03 2022
attempt 260 at Thu Nov 24 21:25:59 -03 2022
attempt 267 at Thu Nov 24 21:26:18 -03 2022
attempt 271 at Thu Nov 24 21:26:28 -03 2022
attempt 282 at Thu Nov 24 21:27:11 -03 2022
attempt 298 at Thu Nov 24 21:27:56 -03 2022
attempt 299 at Thu Nov 24 21:27:58 -03 2022
attempt 320 at Thu Nov 24 21:28:53 -03 2022
attempt 325 at Thu Nov 24 21:29:06 -03 2022
attempt 327 at Thu Nov 24 21:29:11 -03 2022
attempt 331 at Thu Nov 24 21:29:22 -03 2022
attempt 333 at Thu Nov 24 21:29:27 -03 2022
attempt 339 at Thu Nov 24 21:29:43 -03 2022
attempt 355 at Thu Nov 24 21:30:37 -03 2022
attempt 385 at Thu Nov 24 21:32:10 -03 2022
attempt 409 at Thu Nov 24 21:33:13 -03 2022
attempt 438 at Thu Nov 24 21:34:30 -03 2022
attempt 439 at Thu Nov 24 21:34:32 -03 2022
attempt 445 at Thu Nov 24 21:34:48 -03 2022
attempt 463 at Thu Nov 24 21:35:34 -03 2022
attempt 479 at Thu Nov 24 21:36:34 -03 2022
attempt 497 at Thu Nov 24 21:37:24 -03 2022

I think this is conclusive, I can safely say that increasing S6_SERVICES_READYTIME is effective.

skarnet commented 1 year ago

So apparently 5 ms is doing nothing, which in hindsight isn't that surprising because it's a very small delay. 50 ms feels like a bit much, but is probably still in the unnoticeable range, so if it works for you, it should be safe for everyone. I increased the default sleep time to 50 ms in the latest git.

Thanks for testing!

felipecrs commented 1 year ago

I increased the default sleep time to 50 ms in the latest git.

Wow, thanks so much! This will bring me a lot of peace of mind.

just-containers / s6-overlay