Open DerekTBrown opened 5 days ago
Cluster creation fails after just 3s, though the logs eventually show the correct string:
$ time kind create cluster --retain
Creating cluster "kind" ...
✓ Ensuring node image (kindest/node:v1.31.0) 🖼
✗ Preparing nodes 📦
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"
real 0m3.388s
user 0m0.188s
sys 0m0.188s
$ docker logs -f kind-control-plane | grep -E "Reached target .*Multi-User System.*|detected cgroup v1"
[ OK ] Reached target multi-user.target - Multi-User System.
Something funny is going on here. The error stack indicates that we are hitting this line, which seems like it is only expected to be invoked when the 30s log timeout is hit.
Adding to the exceptional weirdness; this fails 100% of the time when invoked directly, but fails ~25% of the time when invoked within strace:
$ strace -f -o trace_output_3.txt kind create cluster
Creating cluster "kind" ...
✓ Ensuring node image (kindest/node:v1.31.0) 🖼
✓ Preparing nodes 📦
✓ Writing configuration 📜
✓ Starting control-plane 🕹️
✓ Installing CNI 🔌
✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:
kubectl cluster-info --context kind-kind
Have a nice day! 👋
To validate my hypothesis, I wrote a wrapper script for docker
:
#!/bin/bash
if [[ "$1" == "logs" ]]; then
sleep 1s
fi
/usr/local/bin/docker "$@"
Cluster creation magically succeeds.
The log attached has 8 worker node, can you paste the exact configuration and steps that youa are running,
a kind create cluster
without any additional config files too?
are you using cgroupsv1?
a kind create cluster without any additional config files too?
This reproduces with just kind create cluster
as well. Will try to get a log export when I have a second.
are you using cgroupsv1?
No, this system has cgroupsv2.
This is missing a lot of important details about your environment that may be useful troubleshooting what is happening. Can you update the description from the requested output from the issue template?
Environment:
kind version
):docker info
, podman info
or nerdctl info
):/etc/os-release
):kubectl version
):Also, since you can get this to repro with just a basic kind create cluster
, please clean things up (docker system prune -a
might not be a bad idea too) and get the logs from that simpler scenario to cut down on the noise. kind create cluster --retain
and kind export logs
.
Can you update the description from the requested output from the issue template?
FYI: I think these are missing from the "Question" template:
They are present in the "Bug" template, but I wasn't sure this was a bug at the time of filing.
kind version: (use kind version):
kind v0.24.0 go1.22.6 linux/amd64
Runtime info: (use docker info, podman info or nerdctl info):
Client: Docker Engine - Community
Version: 27.3.1
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.17.1
Path: /usr/libexec/docker/cli-plugins/docker-buildx
compose: Docker Compose (Docker Inc.)
Version: v2.29.7
Path: /usr/libexec/docker/cli-plugins/docker-compose
Server:
Containers: 8
Running: 8
Paused: 0
Stopped: 0
Images: 4
Server Version: 27.3.1
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: runc io.containerd.runc.v2
Default Runtime: runc
Init Binary: docker-init
containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
runc version: v1.1.14-0-g2c9f560
init version: de40ad0
Security Options:
apparmor
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.8.0-1015-aws
Operating System: Ubuntu 22.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 30.82GiB
Name: ip-10-107-37-51
ID: 28b51eae-72ea-4480-8302-b9ba9b4dc122
Docker Root Dir: /var/lib/docker
Debug Mode: true
File Descriptors: 110
Goroutines: 188
System Time: 2024-10-15T15:16:17.221998998Z
EventsListeners: 0
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
OS (e.g. from /etc/os-release):
PRETTY_NAME="Ubuntu 22.04.5 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.5 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy
Kubernetes version: (use kubectl version)
Client Version: v1.31.1 Kustomize Version: v5.4.2
Any proxies or other special environment settings?: Not that I am aware of. Standard Ubuntu OS.
Thanks for the extra details! I didn't realize it wasn't on the questions template either. :]
Logs from a standalone kind create cluster --retain
:
$ ./kind create cluster --retain
Creating cluster "kind" ...
✓ Ensuring node image (kindest/node:v1.31.0) 🖼
✗ Preparing nodes 📦
ERROR: failed to create cluster: could not find a log line that matches "Reached target .*Multi-User System.*|detected cgroup v1"
Is it possible there is a race condition between docker run
and docker logs
here, such that docker logs
fails?
Is it possible there is a race condition between docker run and docker logs here, such that docker logs fails?
Switched link to permalink so we get the exact current code here: https://github.com/kubernetes-sigs/kind/blob/10e058c534cf6b8546be70f1eea2e414a08ff770/pkg/cluster/internal/providers/docker/provision.go#L410-L417
Run()
waits for process completion. So the container is running, unless there's a bug in docker. (docker run
returns the container ID after starting it, not to be confused with Run()
on the process).
Maybe docker run
started returning the container ID earlier, before the process is actually started?
Something funny is going on here. The error stack indicates that we are hitting this line, which seems like it is only expected to be invoked when the 30s log timeout is hit.
Sounds like docker logs
is failing entirely. Is the container getting restarted once or we just have a race now (maybe https://github.com/kubernetes-sigs/kind/issues/3758#issuecomment-2414603108 ?)
If the latter, then we may have to add some check that the container is running before fetching logs ... but that's been assumed to be true from docker run
for a long time now (which is supposed to be like docker container create && docker container start
, after start
it should be running and it should be possible to start streaming logs ...)
I am unable to create a Kind cluster using
kind create
. I get an error:When I run
kind export logs
, I plainly see this line in thejournal.log
:I don't see anything else in the log bundle to indicate an error:
kind_logs.tar.gz
Appendix: things I eliminated as possible root causes from similar issues
kind
.max_user_instances
ormax_user_watches
. I have increased these to extremely large limits.