Fail to create cluster locally with `linux-lqx` as kernel on Garuda (~Arch) Linux

caniko commented 2 years ago

Why is this needed ~cgroupv2 is supported by k8, should also be supported by kind.~

Revised issue: I am having an issue setting up a kind cluster

stmcginnis commented 2 years ago

@caniko can you provide a little more detail or context? Kind should work fine with cgroupv2.

There were some older releases that did not. Can you provide the version (kind version) you're using?

BenTheElder commented 2 years ago

cgroupv2 is supported, but Kubernetes does not support it before v1.19

https://kind.sigs.k8s.io/docs/user/known-issues/#failure-to-create-cluster-with-cgroups-v2

KIND has has cgroupv2 since v0.10.0 / January 2021, not long after Kubernetes, with some fixes since then https://github.com/kubernetes-sigs/kind/releases/tag/v0.10.0

We test cgroupsv2 on every PR.

caniko commented 2 years ago

Oh, it says cgroup v2 issues on the website. I understand that I interpreted it wrongly now.

kind version: kind v0.14.0 go1.18.2 linux/amd64

I am on Arch Linux, and behind a proxy. Docker and K8 runs fine. I succesfully ran minikube in kvm2 mode. Could the proxy be blocking the health check URL?

Stack trace:

✦ [🔴] × make kc
kind create cluster     && make td
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.24.0) 🖼
 ✓ Preparing nodes 📦  
 ✓ Writing configuration 📜 
 ✗ Starting control-plane 🕹️ 
ERROR: failed to create cluster: failed to init node with kubeadm: command "docker exec --privileged kind-control-plane kubeadm init --skip-phases=preflight --config=/kind/kubeadm.conf --skip-token-print --v=6" failed with error: exit status 1
Command Output: I0802 15:13:00.477226     142 initconfiguration.go:255] loading configuration from "/kind/kubeadm.conf"
W0802 15:13:00.477755     142 initconfiguration.go:332] [config] WARNING: Ignored YAML document with GroupVersionKind kubeadm.k8s.io/v1beta3, Kind=JoinConfiguration
[init] Using Kubernetes version: v1.24.0
[certs] Using certificateDir folder "/etc/kubernetes/pki"
I0802 15:13:00.482701     142 certs.go:112] creating a new certificate authority for ca
[certs] Generating "ca" certificate and key
I0802 15:13:00.592328     142 certs.go:522] validating certificate period for ca certificate
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kind-control-plane kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 172.18.0.2 127.0.0.1]
[certs] Generating "apiserver-kubelet-client" certificate and key
I0802 15:13:00.894601     142 certs.go:112] creating a new certificate authority for front-proxy-ca
[certs] Generating "front-proxy-ca" certificate and key
I0802 15:13:01.034046     142 certs.go:522] validating certificate period for front-proxy-ca certificate
[certs] Generating "front-proxy-client" certificate and key
I0802 15:13:01.081085     142 certs.go:112] creating a new certificate authority for etcd-ca
[certs] Generating "etcd/ca" certificate and key
I0802 15:13:01.134989     142 certs.go:522] validating certificate period for etcd/ca certificate
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [kind-control-plane localhost] and IPs [172.18.0.2 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [kind-control-plane localhost] and IPs [172.18.0.2 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
I0802 15:13:01.665251     142 certs.go:78] creating new public/private key files for signing service account users
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
I0802 15:13:01.871422     142 kubeconfig.go:103] creating kubeconfig file for admin.conf
[kubeconfig] Writing "admin.conf" kubeconfig file
I0802 15:13:02.242946     142 kubeconfig.go:103] creating kubeconfig file for kubelet.conf
[kubeconfig] Writing "kubelet.conf" kubeconfig file
I0802 15:13:02.475033     142 kubeconfig.go:103] creating kubeconfig file for controller-manager.conf
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
I0802 15:13:02.570362     142 kubeconfig.go:103] creating kubeconfig file for scheduler.conf
[kubeconfig] Writing "scheduler.conf" kubeconfig file
I0802 15:13:02.640364     142 kubelet.go:65] Stopping the kubelet
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
I0802 15:13:02.738776     142 manifests.go:99] [control-plane] getting StaticPodSpecs
I0802 15:13:02.739201     142 certs.go:522] validating certificate period for CA certificate
I0802 15:13:02.739245     142 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-apiserver"
I0802 15:13:02.739250     142 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-apiserver"
I0802 15:13:02.739254     142 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-apiserver"
I0802 15:13:02.739258     142 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-apiserver"
I0802 15:13:02.739261     142 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-apiserver"
I0802 15:13:02.740793     142 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-apiserver" to "/etc/kubernetes/manifests/kube-apiserver.yaml"
I0802 15:13:02.740801     142 manifests.go:99] [control-plane] getting StaticPodSpecs
[control-plane] Creating static Pod manifest for "kube-controller-manager"
I0802 15:13:02.740905     142 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-controller-manager"
I0802 15:13:02.740911     142 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-controller-manager"
I0802 15:13:02.740915     142 manifests.go:125] [control-plane] adding volume "flexvolume-dir" for component "kube-controller-manager"
I0802 15:13:02.740919     142 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-controller-manager"
I0802 15:13:02.740923     142 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-controller-manager"
I0802 15:13:02.740927     142 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-controller-manager"
I0802 15:13:02.740931     142 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-controller-manager"
I0802 15:13:02.741438     142 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-controller-manager" to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
I0802 15:13:02.741445     142 manifests.go:99] [control-plane] getting StaticPodSpecs
[control-plane] Creating static Pod manifest for "kube-scheduler"
I0802 15:13:02.741550     142 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-scheduler"
I0802 15:13:02.741873     142 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-scheduler" to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
I0802 15:13:02.742778     142 local.go:65] [etcd] wrote Static Pod manifest for a local etcd member to "/etc/kubernetes/manifests/etcd.yaml"
I0802 15:13:02.742793     142 waitcontrolplane.go:83] [wait-control-plane] Waiting for the API server to be healthy
I0802 15:13:02.743155     142 loader.go:372] Config loaded from file:  /etc/kubernetes/admin.conf
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
I0802 15:13:02.743871     142 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 0 milliseconds
I0802 15:13:03.244407     142 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 0 

...

I0802 15:17:00.744949     142 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 0 milliseconds
I0802 15:17:01.244899     142 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 0 milliseconds
I0802 15:17:01.744759     142 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 0 milliseconds
I0802 15:17:02.244476     142 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 0 milliseconds
I0802 15:17:02.745332     142 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 0 milliseconds
I0802 15:17:02.745615     142 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 0 milliseconds

Unfortunately, an error has occurred:
        timed out waiting for the condition

This error is likely caused by:
        - The kubelet is not running
        - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
        - 'systemctl status kubelet'
        - 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
        - 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
        Once you have found the failing container, you can inspect its logs with:
        - 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'
couldn't initialize a Kubernetes cluster
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase
        cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:108
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:234
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
        cmd/kubeadm/app/cmd/init.go:153
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
        vendor/github.com/spf13/cobra/command.go:856
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
        vendor/github.com/spf13/cobra/command.go:974
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
        vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
        cmd/kubeadm/app/kubeadm.go:50
main.main
        cmd/kubeadm/kubeadm.go:25
runtime.main
        /usr/local/go/src/runtime/proc.go:250
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1571
error execution phase wait-control-plane
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:235
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
        cmd/kubeadm/app/cmd/init.go:153
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
        vendor/github.com/spf13/cobra/command.go:856
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
        vendor/github.com/spf13/cobra/command.go:974
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
        vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
        cmd/kubeadm/app/kubeadm.go:50
main.main
        cmd/kubeadm/kubeadm.go:25
runtime.main
        /usr/local/go/src/runtime/proc.go:250
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1571
make: *** [Makefile:6: kc] Error 1

BenTheElder commented 2 years ago

Can you share the logs from kind create cluster --retain; kind export logs; kind delete cluster

BenTheElder commented 2 years ago

It's possible it's the proxy but we should be setting sufficient no_proxy, we can check what the component logs show.

caniko commented 2 years ago

The stack trace is too long. Should I have saved it to a file?

Here is the tail:

I0802 15:50:20.389803     142 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 0 milliseconds
I0802 15:50:20.390123     142 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 0 milliseconds

Unfortunately, an error has occurred:
timed out waiting for the condition

This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'
couldn't initialize a Kubernetes cluster
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase
cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:108
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
cmd/kubeadm/app/cmd/phases/workflow/runner.go:234
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
cmd/kubeadm/app/cmd/init.go:153
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
vendor/github.com/spf13/cobra/command.go:856
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
vendor/github.com/spf13/cobra/command.go:974
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
cmd/kubeadm/app/kubeadm.go:50
main.main
cmd/kubeadm/kubeadm.go:25
runtime.main
/usr/local/go/src/runtime/proc.go:250
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1571
error execution phase wait-control-plane
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
cmd/kubeadm/app/cmd/phases/workflow/runner.go:235
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
cmd/kubeadm/app/cmd/phases/workflow/runner.go:421
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdInit.func1
cmd/kubeadm/app/cmd/init.go:153
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
vendor/github.com/spf13/cobra/command.go:856
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
vendor/github.com/spf13/cobra/command.go:974
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
vendor/github.com/spf13/cobra/command.go:902
k8s.io/kubernetes/cmd/kubeadm/app.Run
cmd/kubeadm/app/kubeadm.go:50
main.main
cmd/kubeadm/kubeadm.go:25
runtime.main
/usr/local/go/src/runtime/proc.go:250
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1571
Exporting logs for cluster "kind" to:
/tmp/2561045446
Deleting cluster "kind" ...

BenTheElder commented 2 years ago

The stack trace is too long. Should I have saved it to a file?

The stack trace does not contain all the node logs, just kind / kubeadm but https://github.com/kubernetes-sigs/kind/issues/2852#issuecomment-1202835966 would export many more log files to inspect

BenTheElder commented 2 years ago

The error / trace tells us kubelet didn't become healthy, but to find out why kubelet didn't become healthy we need to dig deeper into the system.

caniko commented 2 years ago

There were no log files, could you be more explicit on what I should provide? I made sure to run the command you provided.

stmcginnis commented 2 years ago

Maybe you should break out the commands to one at a time.

kind create cluster --retain first to attempt to create the cluster but not to delete it on failure. kind export logs - this should give you all the log files that should (hopefully) provide the clues needed to figure out what is happening. kind delete cluster can be run once the logs are successfully captured.

caniko commented 2 years ago

I just noticed one line where it said exported to /tmp/random number. Got everything!

Root of logs: kind-version.txt docker-info.txt

Control-plane logs: serial.log kubernetes-version.txt kubelet.log images.log containerd.log alternatives.log journal.log

BenTheElder commented 2 years ago

could you upload a zip or tarball with the full directory? there's some other useful files and github will accept a zip or tarball.

BenTheElder commented 2 years ago

Two things jump out from what we have so far:

This is on Garuda Linux ~= Arch linux, we've had issues with the Arch linux kernels and cgroups before
btrfs filesystem backing docker, while kind has some workarounds to attempt to support this, historically kubernetes and btrfs have had issues, and there's more issues with KIND and btrfs (because kubelet needs to inspect the filesystem). I can't tell from the currently uploaded files if kind detected and handled this, the inspect file would show this

And then there's this in the kubelet logs

Aug 02 15:46:20 kind-control-plane kubelet[212]: E0802 15:46:20.638146 212 kubelet.go:1378] "Failed to start ContainerManager" err="failed to initialize top level QOS containers: root container [kubelet kubepods] doesn't exist"

Aug 02 15:46:21 kind-control-plane kubelet[305]: E0802 15:46:21.939139 305 cgroup_manager_linux.go:473] cgroup manager.Set failed: openat2 /sys/fs/cgroup/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/cpu.weight: no such file or directory

(similar logs in containerd)

Does not look proxy related.

caniko commented 2 years ago

I did not know that the kernel was an important component for K8! I am using linux-lqx-bmq, which is downstream from zen-kernel. Switching back to zen fixed the issue.

Garuda is Arch with some desktop environment sugar. I made the switch from zen to lqx, it is my fault.

BenTheElder commented 2 years ago

It's possible the custom kernel lacks some cgroups support (CPU weight?), IIRC previous problems with Arch kernels were that sort of thing.

Since the kind "nodes" are running on the shared host kernel, it must meet Kubernetes's requirements in terms of e.g. supported cgroup controllers.

BenTheElder commented 2 years ago

This is a known limitation in the kernel https://github.com/damentz/liquorix-package/issues/101#issuecomment-1208165488

vinibali commented 1 year ago

Is Groups v2 a hard requirement? Could we create the kind cluster with cgroups v1?

BenTheElder commented 1 year ago

It isn't yet, some parts of the ecosystem may eventually drop support.

Cgroups of either v1 or v2 are.

vinibali commented 1 year ago

does it mean, some systems which support only cgroups v1 can use kind?

aojea commented 1 year ago

does it mean, some systems which support only cgroups v1 can use kind?

yes, but has to supports all the cgroups "groups" required by kubelet, some arch like old raspberry models doesn't support it per example

damentz commented 10 months ago

This issue is now resolved with Liquorix by implementing the missing stubs that kind and many other container orchestration platforms blindly depend on.

Relevant links:

kubernetes-sigs / kind

Fail to create cluster locally with `linux-lqx` as kernel on Garuda (~Arch) Linux #2852