kubernetes-sigs / kind

Kubernetes IN Docker - local clusters for testing Kubernetes
https://kind.sigs.k8s.io/
Apache License 2.0
13.52k stars 1.57k forks source link

Kind cluster creation fails while Waiting for a healthy kubelet during init #3760

Open fenic-fawkes opened 1 month ago

fenic-fawkes commented 1 month ago

What happened:

kind create cluster fails Waiting for a healthy kubelet full command: kind create cluster --retain --config kind-config.txt --wait 5m

What you expected to happen:

cluster should be created

Anything else we need to know?:

kind logs

Environment:

Server: Containers: 17 Running: 17 Paused: 0 Stopped: 0 Images: 5 Server Version: 26.1.3 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Using metacopy: false Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: cgroupfs Cgroup Version: 1 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 runc Default Runtime: runc Init Binary: docker-init containerd version: 8b3b7ca2e5ce38e8f31a34f35b2b68ceb8470d89 runc version: v1.1.12-0-g51d5e94 init version: de40ad0 Security Options: seccomp Profile: builtin Kernel Version: 4.18.0-553.22.1.el8_10.x86_64 Operating System: Red Hat Enterprise Linux 8.10 (Ootpa) OSType: linux Architecture: x86_64 CPUs: 40 Total Memory: 251.3GiB Name: engdev4 ID: 9e0e4231-b7e3-432b-bc97-6a263007a3b0 Docker Root Dir: /data/docker Debug Mode: false Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false

stmcginnis commented 1 month ago

Is it possible to upgrade your system? 4.18 is a pretty old kernel version, and cgroupv1 support has been slowly going away. I haven't been able to look yet for a specific failure, but I have a feeling those could be two contributing factors to this.

fenic-fawkes commented 1 month ago

unfortunately no, i'm stuck with this setup for the most part.

stmcginnis commented 1 month ago

Hmm, it does look like it is cgroup related:

err="failed to initialize top level QOS containers: error validating root container [kubelet kubepods] : cgroup [\"kubelet\" \"kubepods\"] has some missing paths: /sys/fs/cgroup/systemd/kubelet.slice/kubelet-kubepods.slice"
fenic-fawkes commented 1 month ago

what's different about k8s 1.23 vs 1.24? because 1.23.17 works while 1.24.17 does not

BenTheElder commented 1 month ago

what's different about k8s 1.23 vs 1.24? because 1.23.17 works while 1.24.17 does not

It could be something like the runc version in kubelet, hard to say without a lot of digging.

... Both of those versions are similarly old enough to be out of support upstream in Kubernetes, kind's support is best-effort (we cannot backport anything to those, since we're not a fork, so that really limits our options and it's a lot to support).

https://kubernetes.io/releases/


Regarding RHEL 8 and 4.18 ... please see https://github.com/kubernetes-sigs/kind/issues/3558

Can you use a VM with a newer OS/kernel if you can't alter the host?

Realistically the things we depend on like Kubernetes, containerd, runc are focused on cgroups v2 and more current distros for testing etc. We don't have the resources ourselves to spend a lot of time out-supporting those projects.

https://kubernetes.io/blog/2024/08/14/kubernetes-1-31-moving-cgroup-v1-support-maintenance-mode/

stmcginnis commented 2 weeks ago

Looks like this is related to to an older distro. Anything more to do here from the project side of thing, or can we close this issue?