Closed nocive closed 2 years ago
Ah, that's annoying.
Unfortunately, with /etc/services.d
services there is an unavoidable race condition. We try to mitigate it here with a 5ms sleep, but in your case it seems like it is not enough.
Can you locally patch s6-overlay to change the line to s6-sleep -m 10
and see if it fixes your issue?
I suppose in the future I will have to make that sleep duration configurable, so that people can make /etc/services.d
work even on very slow machines.
I have noticed the same issue in Home Assistant Core since S6-Overlay was updated to version 3. Also noticed a few Home Assistant add-ons too. Intermittently this issue happens.
An option to make it configurable would be the best, @skarnet. Looking forward to it.
/cc @frenck, @pvizeli
@skarnet thanks for the quick reply.
I can definitely try that, although it will take some time to confirm whether it helps or not.
But so, you're saying that using the new rc.d layer would not lead to this issue? I found the migration to the new layer to not be as straightforward as I'd hoped, but I suppose this alone is worth the try.
@nocive Yes, if you convert your services to s6-rc, the race condition goes away. Solving this was one of the first things that went into the s6-rc spec. ^^'
Ok, thanks! I think I'll try that instead, otherwise I might be banished to the shadow realm by my peers if I propose something that monkey patches a s6 file and changes an arbitrary sleep :sweat_smile:
In all seriousness, it's probably for the best to switch to the new layer anyway, it was just not at the top of the priority list :)
Thanks for all your amazing work! <3
BTW refactoring cont-init, services and cont-finish to s6-rc isn't feasible for Home Assistant's case. In the new s6-rc format, every service must declare a dependency on their one-shots. But Home Assistant ships a set of base images which comes with a couple of cont-init.d
scripts. Those scripts are run no matter what there are services depending on them or not.
Converting them to s6-rc would require everyone who adds their own services on top of the base images to be aware of every oneshot and explicitly mark their services as dependents on them.
This is unintended, since we want those cont-init scripts to always run.
Out of curiosity, this was the reason why we didn't convert our legacy scripts to s6-rc while upgrading to v3.
That said, here are my 2-cents:
Thank you, as always, for S6-Overlay!
user
bundle. If you're a distribution based on s6-overlay, you can probably list them in the top
bundle instead, that is defined in /package/admin/s6-overlay/s6-rc/sources/top
. Make sure that these oneshots depend on base
, so they're always run after fix-attrs and cont-init.user2
bundle that is started late, after user
and legacy-services
. If you have services that need to wait until everything else is started, that's a place where you can add them. Don't overuse it, because there won't be any user3
if you need one. :-)S6_STAGE2_HOOK
environment variable, read what it does at the bottom of the list. The idea is that if a distribution needs to preprocess the source definition directories of services, they can run a preprocessor there. You could use this to automatically add dependencies from all services to all the oneshots your distribution provides, even if users do not declare them.Oh wow! Thanks a lot!
I've added an S6_SERVICES_READYTIME
variable to the latest s6-overlay git head.
@nocive I certainly don't want to deter you from converting your services to s6-rc :-) but if you can build s6-overlay from source, could you tell me if using that variable (with a greater value than 5
) solves your issue?
Oops, sorry, I missed all the activity going on here :-)
Unfortunately I immediately jumped into the s6-rc bandwagon and everything has been converted to the new format and has been working like a charm since. I'm afraid I no longer have an environment where I can reproduce the reported issue :no_mouth:
It's fine, converting to s6-rc is a good move no matter what. 😄 I'm closing this then.
@skarnet in the environment where this issue occurs to me, it's very rare, therefore it's difficult to test. I'm trying to find a way to replicate the issue so that I can test if increasing S6_SERVICES_READYTIME
indeed fixes the issue or not.
I have tried several things, without success (i.e. I'm not able to replicate the issue reliably). Do you have any idea on how this race condition can be exercised?
@nocive I'm afraid I no longer have an environment where I can reproduce the reported issue 😶
By any chance do you have any hint on what made your environment reproduce the issue?
@felipecrs The only way to exercise the race condition is to be unlucky with the scheduler's decisions. You can probably increase the frequency by overloading the machine, i.e. making it difficult for a given process to get a CPU timeslice.
The race condition comes from the fact that s6-svwait
accesses control files that are created by s6-supervise
- in other words, it can only work on a service that is already being supervised.
With the s6-rc approach, all the s6-supervise
processes are created before service transitions are attempted, so it always works.
With the /etc/services.d
approach, the supervisors do not preexist - they are started at the same time as the services. So when you wait for the service to be ready, there's a slim chance, if the scheduler is whimsical enough, that s6-svwait
will run before the relevant supervisor has obtained a timeslice and created the necessary control files. Increasing the sleep duration before s6-svwait
gives more opportunities to the scheduler to properly get the supervisors up to speed.
The optimal sleep duration depends on your machine's load, really. If you expect CPU time to be at a premium, increase the duration by a significant amount.
@skarnet thanks a lot for the very detailed explanation. I'll conduct some tests based on this, and will share the result as well.
Ok, this is really complicated. I tried to stress my CPU using the stress with s-tui, but it does not seem to make any difference, I think.
Sorry I had to upload the videos somewhere else because they are too big for GitHub.
Multiple failures after 1 minute of video. I was able to reproduce the failure several times offline as well.
https://1drv.ms/v/s!AntpCaY1YjnMi6ItGFtMq9wgcVb4Nw?e=RoLwsh
No failures, almost 5 minutes of testing. I wasn't able to catch a failure at all using S6_SERVICES_READY=50
.
https://1drv.ms/v/s!AntpCaY1YjnMi6ItGFtMq9wgcVb4Nw?e=ZdP36s
I know the testing environment is far from ideal, but I think this is already a very good signal.
I wrote a simple script to test it for me:
#!/bin/bash
function test() {
rm -f "$1"-log.txt
local attempts=500
local failures=0
for i in $(seq 1 $attempts); do
clear
echo "S6_SERVICES_READYTIME=${1} - Attempt $i of $attempts"
set -x
docker rm -f test
docker run --name test --env=S6_SERVICES_READYTIME="$1" --rm -d ghcr.io/home-assistant/generic-x86-64-homeassistant:2022.11.4
sleep 1.25s
set +x
if docker logs test 2>&1 |
tee /dev/stderr |
grep -q "s6-svwait: fatal: unable to s6_svstatus_read: No such file or directory"; then
echo "attempt ${i} at $(date)" >> "$1"-log.txt
failures=$((failures + 1))
# Bell
echo -e '\a'
fi
done
echo "Failed ${failures} out of ${attempts} attempts for S6_SERVICES_READYTIME=${1}" >> result.txt
}
test 0
test 5
test 50
Here's the result:
Failed 43 out of 500 attempts for S6_SERVICES_READYTIME=0
Failed 48 out of 500 attempts for S6_SERVICES_READYTIME=5
Failed 0 out of 500 for S6_SERVICES_READYTIME=50
Failures for S6_SERVICES_READYTIME=0
:
attempt 1 at Thu Nov 24 20:17:48 -03 2022
attempt 7 at Thu Nov 24 20:18:03 -03 2022
attempt 9 at Thu Nov 24 20:18:08 -03 2022
attempt 12 at Thu Nov 24 20:18:16 -03 2022
attempt 13 at Thu Nov 24 20:18:19 -03 2022
attempt 53 at Thu Nov 24 20:20:13 -03 2022
attempt 58 at Thu Nov 24 20:20:28 -03 2022
attempt 80 at Thu Nov 24 20:21:35 -03 2022
attempt 93 at Thu Nov 24 20:22:44 -03 2022
attempt 114 at Thu Nov 24 20:23:51 -03 2022
attempt 115 at Thu Nov 24 20:23:53 -03 2022
attempt 119 at Thu Nov 24 20:24:03 -03 2022
attempt 135 at Thu Nov 24 20:24:56 -03 2022
attempt 156 at Thu Nov 24 20:25:50 -03 2022
attempt 177 at Thu Nov 24 20:26:43 -03 2022
attempt 202 at Thu Nov 24 20:27:47 -03 2022
attempt 205 at Thu Nov 24 20:27:55 -03 2022
attempt 215 at Thu Nov 24 20:28:21 -03 2022
attempt 228 at Thu Nov 24 20:28:53 -03 2022
attempt 231 at Thu Nov 24 20:29:15 -03 2022
attempt 232 at Thu Nov 24 20:29:31 -03 2022
attempt 235 at Thu Nov 24 20:29:38 -03 2022
attempt 240 at Thu Nov 24 20:29:51 -03 2022
attempt 250 at Thu Nov 24 20:30:17 -03 2022
attempt 259 at Thu Nov 24 20:30:40 -03 2022
attempt 282 at Thu Nov 24 20:32:08 -03 2022
attempt 296 at Thu Nov 24 20:32:44 -03 2022
attempt 330 at Thu Nov 24 20:34:11 -03 2022
attempt 349 at Thu Nov 24 20:34:59 -03 2022
attempt 352 at Thu Nov 24 20:35:07 -03 2022
attempt 359 at Thu Nov 24 20:35:28 -03 2022
attempt 398 at Thu Nov 24 20:37:22 -03 2022
attempt 411 at Thu Nov 24 20:37:55 -03 2022
attempt 417 at Thu Nov 24 20:38:11 -03 2022
attempt 425 at Thu Nov 24 20:38:31 -03 2022
attempt 428 at Thu Nov 24 20:38:39 -03 2022
attempt 434 at Thu Nov 24 20:38:54 -03 2022
attempt 441 at Thu Nov 24 20:39:12 -03 2022
attempt 444 at Thu Nov 24 20:39:19 -03 2022
attempt 461 at Thu Nov 24 20:40:02 -03 2022
attempt 463 at Thu Nov 24 20:40:07 -03 2022
attempt 477 at Thu Nov 24 20:40:43 -03 2022
attempt 497 at Thu Nov 24 20:41:35 -03 2022
Failures for S6_SERVICES_READYTIME=5
:
attempt 16 at Thu Nov 24 21:13:33 -03 2022
attempt 21 at Thu Nov 24 21:13:46 -03 2022
attempt 31 at Thu Nov 24 21:14:14 -03 2022
attempt 37 at Thu Nov 24 21:14:30 -03 2022
attempt 38 at Thu Nov 24 21:14:32 -03 2022
attempt 54 at Thu Nov 24 21:15:16 -03 2022
attempt 58 at Thu Nov 24 21:15:26 -03 2022
attempt 59 at Thu Nov 24 21:15:29 -03 2022
attempt 60 at Thu Nov 24 21:15:32 -03 2022
attempt 63 at Thu Nov 24 21:15:40 -03 2022
attempt 65 at Thu Nov 24 21:15:45 -03 2022
attempt 85 at Thu Nov 24 21:16:40 -03 2022
attempt 86 at Thu Nov 24 21:16:42 -03 2022
attempt 96 at Thu Nov 24 21:17:09 -03 2022
attempt 99 at Thu Nov 24 21:17:17 -03 2022
attempt 104 at Thu Nov 24 21:17:29 -03 2022
attempt 110 at Thu Nov 24 21:17:44 -03 2022
attempt 112 at Thu Nov 24 21:17:49 -03 2022
attempt 116 at Thu Nov 24 21:17:59 -03 2022
attempt 124 at Thu Nov 24 21:18:20 -03 2022
attempt 132 at Thu Nov 24 21:18:41 -03 2022
attempt 143 at Thu Nov 24 21:19:27 -03 2022
attempt 154 at Thu Nov 24 21:20:12 -03 2022
attempt 163 at Thu Nov 24 21:20:36 -03 2022
attempt 202 at Thu Nov 24 21:22:38 -03 2022
attempt 231 at Thu Nov 24 21:24:01 -03 2022
attempt 239 at Thu Nov 24 21:24:22 -03 2022
attempt 260 at Thu Nov 24 21:25:59 -03 2022
attempt 267 at Thu Nov 24 21:26:18 -03 2022
attempt 271 at Thu Nov 24 21:26:28 -03 2022
attempt 282 at Thu Nov 24 21:27:11 -03 2022
attempt 298 at Thu Nov 24 21:27:56 -03 2022
attempt 299 at Thu Nov 24 21:27:58 -03 2022
attempt 320 at Thu Nov 24 21:28:53 -03 2022
attempt 325 at Thu Nov 24 21:29:06 -03 2022
attempt 327 at Thu Nov 24 21:29:11 -03 2022
attempt 331 at Thu Nov 24 21:29:22 -03 2022
attempt 333 at Thu Nov 24 21:29:27 -03 2022
attempt 339 at Thu Nov 24 21:29:43 -03 2022
attempt 355 at Thu Nov 24 21:30:37 -03 2022
attempt 385 at Thu Nov 24 21:32:10 -03 2022
attempt 409 at Thu Nov 24 21:33:13 -03 2022
attempt 438 at Thu Nov 24 21:34:30 -03 2022
attempt 439 at Thu Nov 24 21:34:32 -03 2022
attempt 445 at Thu Nov 24 21:34:48 -03 2022
attempt 463 at Thu Nov 24 21:35:34 -03 2022
attempt 479 at Thu Nov 24 21:36:34 -03 2022
attempt 497 at Thu Nov 24 21:37:24 -03 2022
I think this is conclusive, I can safely say that increasing S6_SERVICES_READYTIME
is effective.
So apparently 5 ms is doing nothing, which in hindsight isn't that surprising because it's a very small delay. 50 ms feels like a bit much, but is probably still in the unnoticeable range, so if it works for you, it should be safe for everyone. I increased the default sleep time to 50 ms in the latest git.
Thanks for testing!
I increased the default sleep time to 50 ms in the latest git.
Wow, thanks so much! This will bring me a lot of peace of mind.
Since we upgraded to 3.x we started observing some seemingly random errors during the startup sequence of some of our applications. The error symptoms are as follows:
This is a container that runs a legacy cont-init layer and two services via the legacy services.d layer (nginx + php-fpm). Other s6 environment details:
Unfortunately I don't have an isolated reproduction scenario that I can provide, the issue is also happening intermittently for us and I haven't found a way to consistently reproduce it. We can only observe it during our CI pipelines, despite using the exact same docker containers that we use everywhere else.
I'm mostly posting this looking for hints or clues as to what might be going on.