kubernetes-sigs / kind

Kubernetes IN Docker - local clusters for testing Kubernetes
https://kind.sigs.k8s.io/
Apache License 2.0
13.57k stars 1.57k forks source link

cluster fails to start if mutiple control plane nodes are added. #3680

Open terryjix opened 4 months ago

terryjix commented 4 months ago

What happened: the cluster is failed to create if I add mutiple control plane nodes to the cluster

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
- role: worker
- role: worker

Error logs

{"level":"warn","ts":"2024-07-11T08:50:49.476205Z","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00062ee00/172.18.0.5:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
I0711 08:50:49.476269     249 etcd.go:550] [etcd] Promoting the learner 86e5aab36dbb6fb7 failed: etcdserver: can only promote a learner member which is in sync with leader
etcdserver: can only promote a learner member which is in sync with leader
error creating local etcd static pod manifest file
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runEtcdPhase
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/controlplanejoin.go:156
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:259
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:446
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:232
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdJoin.func1
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/join.go:183
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/cobra@v1.7.0/command.go:940
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/cobra@v1.7.0/command.go:1068
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/cobra@v1.7.0/command.go:992
k8s.io/kubernetes/cmd/kubeadm/app.Run
        k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:52
main.main
        k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
        runtime/proc.go:271
runtime.goexit
        runtime/asm_amd64.s:1695
error execution phase control-plane-join/etcd
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:260
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:446
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:232
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdJoin.func1
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/join.go:183
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/cobra@v1.7.0/command.go:940
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/cobra@v1.7.0/command.go:1068
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/cobra@v1.7.0/command.go:992
k8s.io/kubernetes/cmd/kubeadm/app.Run
        k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:52
main.main
        k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
        runtime/proc.go:271
runtime.goexit
        runtime/asm_amd64.s:169

What you expected to happen: kind supports to create a kubernetes cluster with mutiple control plane nodes.

How to reproduce it (as minimally and precisely as possible): use following configuration to launch a cluster

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
- role: worker
- role: worker

Anything else we need to know?:

Environment:

Server: Containers: 21 Running: 0 Paused: 0 Stopped: 21 Images: 78 Server Version: 25.0.3 Storage Driver: overlay2 Backing Filesystem: xfs Supports d_type: true Using metacopy: false Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: systemd Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 runc Default Runtime: runc Init Binary: docker-init containerd version: 64b8a811b07ba6288238eefc14d898ee0b5b99ba runc version: 4bccb38cc9cf198d52bebf2b3a90cd14e7af8c06 init version: de40ad0 Security Options: seccomp Profile: builtin cgroupns Kernel Version: 6.1.94-99.176.amzn2023.x86_64 Operating System: Amazon Linux 2023.5.20240701 OSType: linux Architecture: x86_64 CPUs: 2 Total Memory: 7.629GiB Name: ip-172-31-18-230.eu-west-1.compute.internal ID: c3b0373c-7367-45d1-8e7b-12a0ff695616 Docker Root Dir: /var/lib/docker Debug Mode: false Experimental: false Insecure Registries: binglj.people.aws.dev:443 127.0.0.0/8 Live Restore Enabled: false

neolit123 commented 4 months ago

I0711 08:50:49.476269 249 etcd.go:550] [etcd] Promoting the learner 86e5aab36dbb6fb7 failed: etcdserver: can only promote a learner member which is in sync with leader etcdserver: can only promote a learner member which is in sync with leader error creating local etcd static pod manifest file

@pacoxu didn't we wait for sync to happen before promote?

terryjix commented 4 months ago

I added following arguments to the configuration file

  kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    featureGates:
      EtcdLearnerMode: false

the kubelet fails to start with another error message

Jul 11 09:46:25 k8s-playground-worker kubelet[379]: I0711 09:46:25.281034     379 factory.go:221] Registration of the systemd container factory successfully
Jul 11 09:46:25 k8s-playground-worker kubelet[379]: I0711 09:46:25.281297     379 factory.go:219] Registration of the crio container factory failed: Get "http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info": dial unix /var/run/crio/crio.sock: connect: no such file or directory
Jul 11 09:46:25 k8s-playground-worker kubelet[379]: I0711 09:46:25.290080     379 factory.go:221] Registration of the containerd container factory successfully
Jul 11 09:46:25 k8s-playground-worker kubelet[379]: E0711 09:46:25.290422     379 manager.go:294] Registration of the raw container factory failed: inotify_init: too many open files
Jul 11 09:46:25 k8s-playground-worker kubelet[379]: E0711 09:46:25.290542     379 kubelet.go:1530] "Failed to start cAdvisor" err="inotify_init: too many open files"
neolit123 commented 4 months ago

Failed to start cAdvisor" err="inotify_init: too many open files"

maybe an ulimit problem: https://github.com/kubernetes-sigs/kind/issues/2744#issuecomment-1127808069

terryjix commented 4 months ago

no ulimit issue if I only add one control-plane node to the cluster. trying to find a way to update the sysctl configuration

BenTheElder commented 4 months ago

inotify: https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files

BenTheElder commented 4 months ago

This is a lot of nodes, do you need them? for what purpose?

most development should prefer single node clusters. each node consumes resources from the host and unlike a "real" cluster adding more nodes does not actually add more resources (only falsely), you are almost certainly hitting resource limits on the host (see the known-issues doc re: inotify above, though this may not be the only limit you're hitting)