Race condition causes`"Reached target .*Multi-User System.*"`

DerekTBrown commented 5 days ago

I am unable to create a Kind cluster using kind create. I get an error:


ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*\|detected cgroup v1"
--
  | Stack Trace:
  | sigs.k8s.io/kind/pkg/errors.Errorf
  | sigs.k8s.io/kind/pkg/errors/errors.go:41
  | sigs.k8s.io/kind/pkg/cluster/internal/providers/common.WaitUntilLogRegexpMatches
  | sigs.k8s.io/kind/pkg/cluster/internal/providers/common/cgroups.go:84
  | sigs.k8s.io/kind/pkg/cluster/internal/providers/docker.createContainerWithWaitUntilSystemdReachesMultiUserSystem
  | sigs.k8s.io/kind/pkg/cluster/internal/providers/docker/provision.go:417
  | sigs.k8s.io/kind/pkg/cluster/internal/providers/docker.planCreation.func3
  | sigs.k8s.io/kind/pkg/cluster/internal/providers/docker/provision.go:123
  | sigs.k8s.io/kind/pkg/errors.UntilErrorConcurrent.func1
  | sigs.k8s.io/kind/pkg/errors/concurrent.go:30
  | runtime.goexit
  | runtime/asm_amd64.s:1695
  | creating kind cluster: exit status 1

When I run kind export logs, I plainly see this line in the journal.log:

Oct 15 00:54:00 <my-cluster-name> systemd[1]: Reached target multi-user.target - Multi-User System.

I don't see anything else in the log bundle to indicate an error:

kind_logs.tar.gz

Appendix: things I eliminated as possible root causes from similar issues

I am on the latest version of kind.
It doesn't appear to be related to max_user_instances or max_user_watches. I have increased these to extremely large limits.
It doesn't appear to be related to kernel version, I am on a relatively new kernel.
It doesn't appear to be related to OS Architecture. I don't see anything in the logs to indicate a Docker image/machine arch mismatch, and the Docker image manifests seem to align with the underlying infrastructure (AMD64).

DerekTBrown commented 5 days ago

Cluster creation fails after just 3s, though the logs eventually show the correct string:

$ time kind create cluster --retain
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.31.0) 🖼
 ✗ Preparing nodes 📦
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"

real    0m3.388s
user    0m0.188s
sys 0m0.188s

$ docker logs -f kind-control-plane | grep -E "Reached target .*Multi-User System.*|detected cgroup v1"
[  OK  ] Reached target multi-user.target - Multi-User System.

DerekTBrown commented 5 days ago

Something funny is going on here. The error stack indicates that we are hitting this line, which seems like it is only expected to be invoked when the 30s log timeout is hit.

DerekTBrown commented 5 days ago

Adding to the exceptional weirdness; this fails 100% of the time when invoked directly, but fails ~25% of the time when invoked within strace:

$ strace -f -o trace_output_3.txt kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.31.0) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Have a nice day! 👋

DerekTBrown commented 5 days ago

To validate my hypothesis, I wrote a wrapper script for docker:

#!/bin/bash

if [[ "$1" == "logs" ]]; then
  sleep 1s
fi

/usr/local/bin/docker "$@"

Cluster creation magically succeeds.

aojea commented 5 days ago

The log attached has 8 worker node, can you paste the exact configuration and steps that youa are running, a kind create cluster without any additional config files too? are you using cgroupsv1?

DerekTBrown commented 5 days ago

a kind create cluster without any additional config files too?

This reproduces with just kind create cluster as well. Will try to get a log export when I have a second.

DerekTBrown commented 5 days ago

are you using cgroupsv1?

No, this system has cgroupsv2.

stmcginnis commented 5 days ago

This is missing a lot of important details about your environment that may be useful troubleshooting what is happening. Can you update the description from the requested output from the issue template?

Environment:

kind version: (use kind version):
Runtime info: (use docker info, podman info or nerdctl info):
OS (e.g. from /etc/os-release):
Kubernetes version: (use kubectl version):
Any proxies or other special environment settings?:

Also, since you can get this to repro with just a basic kind create cluster, please clean things up (docker system prune -a might not be a bad idea too) and get the logs from that simpler scenario to cut down on the noise. kind create cluster --retain and kind export logs.

DerekTBrown commented 5 days ago

Can you update the description from the requested output from the issue template?

FYI: I think these are missing from the "Question" template:

They are present in the "Bug" template, but I wasn't sure this was a bug at the time of filing.

DerekTBrown commented 5 days ago

kind version: (use kind version):

kind v0.24.0 go1.22.6 linux/amd64

Runtime info: (use docker info, podman info or nerdctl info):

Client: Docker Engine - Community
 Version:    27.3.1
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.17.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.29.7
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 8
  Running: 8
  Paused: 0
  Stopped: 0
 Images: 4
 Server Version: 27.3.1
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
 runc version: v1.1.14-0-g2c9f560
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 6.8.0-1015-aws
 Operating System: Ubuntu 22.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 8
 Total Memory: 30.82GiB
 Name: ip-10-107-37-51
 ID: 28b51eae-72ea-4480-8302-b9ba9b4dc122
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 110
  Goroutines: 188
  System Time: 2024-10-15T15:16:17.221998998Z
  EventsListeners: 0
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled

OS (e.g. from /etc/os-release):

PRETTY_NAME="Ubuntu 22.04.5 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.5 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Kubernetes version: (use kubectl version)

Client Version: v1.31.1
Kustomize Version: v5.4.2

Any proxies or other special environment settings?: Not that I am aware of. Standard Ubuntu OS.

stmcginnis commented 5 days ago

Thanks for the extra details! I didn't realize it wasn't on the questions template either. :]

DerekTBrown commented 5 days ago

Logs from a standalone kind create cluster --retain:

$ ./kind create cluster --retain
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.31.0) 🖼
 ✗ Preparing nodes 📦
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"

kind-logs.tar.gz

DerekTBrown commented 5 days ago

Is it possible there is a race condition between docker run and docker logs here, such that docker logs fails?

https://github.com/kubernetes-sigs/kind/blob/main/pkg/cluster/internal/providers/docker/provision.go#L410-L417

BenTheElder commented 5 days ago

Is it possible there is a race condition between docker run and docker logs here, such that docker logs fails?

Switched link to permalink so we get the exact current code here: https://github.com/kubernetes-sigs/kind/blob/10e058c534cf6b8546be70f1eea2e414a08ff770/pkg/cluster/internal/providers/docker/provision.go#L410-L417

Run() waits for process completion. So the container is running, unless there's a bug in docker. (docker run returns the container ID after starting it, not to be confused with Run() on the process).

BenTheElder commented 5 days ago

Maybe docker run started returning the container ID earlier, before the process is actually started?

BenTheElder commented 5 days ago

Something funny is going on here. The error stack indicates that we are hitting this line, which seems like it is only expected to be invoked when the 30s log timeout is hit.

Sounds like docker logs is failing entirely. Is the container getting restarted once or we just have a race now (maybe https://github.com/kubernetes-sigs/kind/issues/3758#issuecomment-2414603108 ?)

If the latter, then we may have to add some check that the container is running before fetching logs ... but that's been assumed to be true from docker run for a long time now (which is supposed to be like docker container create && docker container start, after start it should be running and it should be possible to start streaming logs ...)

kubernetes-sigs / kind