kubesphere / kubekey

Install Kubernetes/K3s only, both Kubernetes/K3s and KubeSphere, and related cloud-native add-ons, it supports all-in-one, multi-node, and HA 🔥 ⎈ 🐳
https://kubesphere.io
Apache License 2.0
2.29k stars 540 forks source link

HA install mode fails on versions of K8s 1.29 and above #2375

Open lbrigman124 opened 2 weeks ago

lbrigman124 commented 2 weeks ago

What is version of KubeKey has the issue?

v3.1.5

What is your os environment?

Rocky 9.3

KubeKey config file

apiVersion: kubekey.kubesphere.io/v1alpha2
kind: Cluster
metadata:
  name: sample
spec:
  hosts:
  - {name: node1, address: 10.109.182.10, internalAddress: 10.109.182.10, user: mdc, privateKeyPath: ~/.ssh/id_rsa }
  - {name: node2, address: 10.109.182.11, internalAddress: 10.109.182.11, user: mdc, privateKeyPath: ~/.ssh/id_rsa}
  - {name: node3, address: 10.109.182.12, internalAddress: 10.109.182.12, user: mdc, privateKeyPath: ~/.ssh/id_rsa}
  roleGroups:
    etcd:
     - node1
     - node2
     - node3
    control-plane:
    - node1
    - node2
    - node3
    worker:
    - node1
    - node2
    - node3
  controlPlaneEndpoint:
    ## Internal loadbalancer for apiservers
    internalLoadbalancer: kube-vip
    domain: lbgsm9.lab.c-cor.com
    address: "10.109.180.9"
    port: 6443
  kubernetes:
    version: v1.29.7
    clusterName: cluster.local
    autoRenewCerts: true
    containerManager: containerd
  etcd:
    type: kubekey
  network:
    plugin: calico
    kubePodsCIDR: 10.233.64.0/18
    kubeServiceCIDR: 10.233.0.0/18
    ## multus support. https://github.com/k8snetworkplumbingwg/multus-cni
    multusCNI:
      enabled: false
  registry:
    privateRegistry: ""
    namespaceOverride: ""
    registryMirrors: []
    insecureRegistries: []
  addons: []

A clear and concise description of what happend.

Running ./kk create cluster -f config.yaml fails to create a cluster when the config file is configured for kube-vip (HA) mode. The kubelet initializes but before things can proceed far enough to complete it gets and error on the kubelet. Kubekey goes through the retries but fails each time. It is probably not an issue with the kubelet config as by this point in time; the program is doing a retrys and cannot connect to etcd. The etcd service is in a crash state and won't recover. node2 and node3 don't have issues with etcd. Just node1.

Relevant log output

kubekey log in debug:
13:49:07 PDT [InitKubernetesModule] Init cluster using kubeadm
13:53:44 PDT command: [node1]
sudo -E /bin/bash -c "/usr/local/bin/kubeadm init --config=/etc/kubernetes/kubeadm-config.yaml --ignore-preflight-errors=FileExisting-crictl,ImagePull"
13:53:44 PDT stdout: [node1]
W0826 13:49:06.282921   15482 utils.go:69] The recommended value for "clusterDNS" in "KubeletConfiguration" is: [10.233.0.10]; the provided value is: [169.254.25.10]
[init] Using Kubernetes version: v1.29.7
[preflight] Running pre-flight checks
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder "/etc/kubernetes/pki"
[certs] Generating "ca" certificate and key
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local lbgsm9.lab.c-cor.com localhost node1 node1.cluster.local node2 node2.cluster.local node3 node3.cluster.local] and IPs [10.233.0.1 10.109.182.10 127.0.0.1 10.109.180.9 10.109.182.11 10.109.182.12]
[certs] Generating "apiserver-kubelet-client" certificate and key
[certs] Generating "front-proxy-ca" certificate and key
[certs] Generating "front-proxy-client" certificate and key
[certs] External etcd mode: Skipping etcd/ca certificate authority generation
[certs] External etcd mode: Skipping etcd/server certificate generation
[certs] External etcd mode: Skipping etcd/peer certificate generation
[certs] External etcd mode: Skipping etcd/healthcheck-client certificate generation
[certs] External etcd mode: Skipping apiserver-etcd-client certificate generation
[certs] Generating "sa" key and public key
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
[kubeconfig] Writing "super-admin.conf" kubeconfig file
[kubeconfig] Writing "kubelet.conf" kubeconfig file
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
[kubeconfig] Writing "scheduler.conf" kubeconfig file
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.

Unfortunately, an error has occurred:
        timed out waiting for the condition

This error is likely caused by:
        - The kubelet is not running
        - The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
        - 'systemctl status kubelet'
        - 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI.
Here is one example how you may list all running Kubernetes containers by using crictl:
        - 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock ps -a | grep kube | grep -v pause'
        Once you have found the failing container, you can inspect its logs with:
        - 'crictl --runtime-endpoint unix:///run/containerd/containerd.sock logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
To see the stack trace of this error execute with --v=5 or higher
13:53:44 PDT stderr: [node1]
Failed to exec command: sudo -E /bin/bash -c "/usr/local/bin/kubeadm init --config=/etc/kubernetes/kubeadm-config.yaml --ignore-preflight-errors=FileExisting-crictl,ImagePull"
W0826 13:49:06.282921   15482 utils.go:69] The recommended value for "clusterDNS" in "KubeletConfiguration" is: [10.233.0.10]; the provided value is: [169.254.25.10]

etcd systemd status:
[root@node1 ~]# systemctl status etcd
● etcd.service - etcd
     Loaded: loaded (/etc/systemd/system/etcd.service; enabled; preset: disabled)
     Active: activating (auto-restart) (Result: exit-code) since Mon 2024-08-26 14:01:49 PDT; 4s ago
    Process: 17726 ExecStart=/usr/local/bin/etcd (code=exited, status=2)
   Main PID: 17726 (code=exited, status=2)
        CPU: 54ms

journalctl -u etcd --no-pager
Aug 26 14:06:15 node1 etcd[18183]: {"level":"warn","ts":"2024-08-26T14:06:15.753167-0700","caller":"rafthttp/http.go:413","msg":"failed to find remote peer in cluster","local-member-id":"5e0742d986bf1110","remote-peer-id-stream-handler":"5e0742d986bf1110","remote-peer-id-from":"5fd1c4ab001a31ce","cluster-id":"600f1da756da65c0"}
Aug 26 14:06:15 node1 etcd[18183]: {"level":"warn","ts":"2024-08-26T14:06:15.75359-0700","caller":"rafthttp/http.go:413","msg":"failed to find remote peer in cluster","local-member-id":"5e0742d986bf1110","remote-peer-id-stream-handler":"5e0742d986bf1110","remote-peer-id-from":"51a69158fcbf5a1e","cluster-id":"600f1da756da65c0"}
Aug 26 14:06:15 node1 etcd[18183]: {"level":"warn","ts":"2024-08-26T14:06:15.756037-0700","caller":"rafthttp/http.go:413","msg":"failed to find remote peer in cluster","local-member-id":"5e0742d986bf1110","remote-peer-id-stream-handler":"5e0742d986bf1110","remote-peer-id-from":"5fd1c4ab001a31ce","cluster-id":"600f1da756da65c0"}
Aug 26 14:06:15 node1 etcd[18183]: {"level":"info","ts":"2024-08-26T14:06:15.813841-0700","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"5e0742d986bf1110 [term: 0] received a MsgHeartbeat message with higher term from 5fd1c4ab001a31ce [term: 3]"}
Aug 26 14:06:15 node1 etcd[18183]: {"level":"info","ts":"2024-08-26T14:06:15.813877-0700","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"5e0742d986bf1110 became follower at term 3"}
Aug 26 14:06:15 node1 etcd[18183]: {"level":"panic","ts":"2024-08-26T14:06:15.813888-0700","logger":"raft","caller":"etcdserver/zap_raft.go:101","msg":"tocommit(989) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?","stacktrace":"go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf\n\tgo.etcd.io/etcd/server/v3/etcdserver/zap_raft.go:101\ngo.etcd.io/etcd/raft/v3.(*raftLog).commitTo\n\tgo.etcd.io/etcd/raft/v3@v3.5.13/log.go:237\ngo.etcd.io/etcd/raft/v3.(*raft).handleHeartbeat\n\tgo.etcd.io/etcd/raft/v3@v3.5.13/raft.go:1508\ngo.etcd.io/etcd/raft/v3.stepFollower\n\tgo.etcd.io/etcd/raft/v3@v3.5.13/raft.go:1434\ngo.etcd.io/etcd/raft/v3.(*raft).Step\n\tgo.etcd.io/etcd/raft/v3@v3.5.13/raft.go:975\ngo.etcd.io/etcd/raft/v3.(*node).run\n\tgo.etcd.io/etcd/raft/v3@v3.5.13/node.go:356"}
Aug 26 14:06:15 node1 etcd[18183]: panic: tocommit(989) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
Aug 26 14:06:15 node1 etcd[18183]: goroutine 148 [running]:
Aug 26 14:06:15 node1 etcd[18183]: go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc000364180, {0x0, 0x0, 0x0})
Aug 26 14:06:15 node1 etcd[18183]:         go.uber.org/zap@v1.17.0/zapcore/entry.go:234 +0x494
Aug 26 14:06:15 node1 etcd[18183]: go.uber.org/zap.(*SugaredLogger).log(0xc0000a8868, 0x4, {0x110e97b?, 0x40e35b?}, {0xc0003b7080?, 0xee9980?, 0xc000168d00?}, {0x0, 0x0, 0x0})
Aug 26 14:06:15 node1 etcd[18183]:         go.uber.org/zap@v1.17.0/sugar.go:227 +0xec
Aug 26 14:06:15 node1 etcd[18183]: go.uber.org/zap.(*SugaredLogger).Panicf(...)
Aug 26 14:06:15 node1 etcd[18183]:         go.uber.org/zap@v1.17.0/sugar.go:159
Aug 26 14:06:15 node1 etcd[18183]: go.etcd.io/etcd/server/v3/etcdserver.(*zapRaftLogger).Panicf(0x3dd?, {0x110e97b?, 0xc00028b840?}, {0xc0003b7080?, 0x55fb4c?, 0x47c4d8?})
Aug 26 14:06:15 node1 etcd[18183]:         go.etcd.io/etcd/server/v3/etcdserver/zap_raft.go:101 +0x45
Aug 26 14:06:15 node1 etcd[18183]: go.etcd.io/etcd/raft/v3.(*raftLog).commitTo(0xc0003a8770, 0x3dd)
Aug 26 14:06:15 node1 etcd[18183]:         go.etcd.io/etcd/raft/v3@v3.5.13/log.go:237 +0xf3
Aug 26 14:06:15 node1 etcd[18183]: go.etcd.io/etcd/raft/v3.(*raft).handleHeartbeat(_, {0x8, 0x5e0742d986bf1110, 0x5fd1c4ab001a31ce, 0x3, 0x0, 0x0, {0x0, 0x0, 0x0}, ...})
Aug 26 14:06:15 node1 etcd[18183]:         go.etcd.io/etcd/raft/v3@v3.5.13/raft.go:1508 +0x39
Aug 26 14:06:15 node1 etcd[18183]: go.etcd.io/etcd/raft/v3.stepFollower(_, {0x8, 0x5e0742d986bf1110, 0x5fd1c4ab001a31ce, 0x3, 0x0, 0x0, {0x0, 0x0, 0x0}, ...})
Aug 26 14:06:15 node1 etcd[18183]:         go.etcd.io/etcd/raft/v3@v3.5.13/raft.go:1434 +0x3b8
Aug 26 14:06:15 node1 etcd[18183]: go.etcd.io/etcd/raft/v3.(*raft).Step(_, {0x8, 0x5e0742d986bf1110, 0x5fd1c4ab001a31ce, 0x3, 0x0, 0x0, {0x0, 0x0, 0x0}, ...})
Aug 26 14:06:15 node1 etcd[18183]:         go.etcd.io/etcd/raft/v3@v3.5.13/raft.go:975 +0x12f5
Aug 26 14:06:15 node1 etcd[18183]: go.etcd.io/etcd/raft/v3.(*node).run(0xc000337740)
Aug 26 14:06:15 node1 etcd[18183]:         go.etcd.io/etcd/raft/v3@v3.5.13/node.go:356 +0x925
Aug 26 14:06:15 node1 etcd[18183]: created by go.etcd.io/etcd/raft/v3.RestartNode in goroutine 1
Aug 26 14:06:15 node1 etcd[18183]:         go.etcd.io/etcd/raft/v3@v3.5.13/node.go:244 +0x24f
Aug 26 14:06:15 node1 systemd[1]: etcd.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Aug 26 14:06:15 node1 systemd[1]: etcd.service: Failed with result 'exit-code'.
Aug 26 14:06:15 node1 systemd[1]: Failed to start etcd.

Kubelet service status:
[root@node1 log]# systemctl status kubelet
○ kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; preset: disabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─10-kubeadm.conf
     Active: inactive (dead) since Mon 2024-08-26 14:02:55 PDT; 34min ago
   Duration: 4min 21.135s
       Docs: http://kubernetes.io/docs/
    Process: 17271 ExecStart=/usr/local/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBE>
   Main PID: 17271 (code=exited, status=0/SUCCESS)
        CPU: 2.749s

Aug 26 14:02:52 node1 kubelet[17271]: E0826 14:02:52.667322   17271 reflector.go:147] k8s.io/client-go@v0.0.0/tools/cache/ref>
Aug 26 14:02:52 node1 kubelet[17271]: E0826 14:02:52.666962   17271 event.go:355] "Unable to write event (may retry after sle>
Aug 26 14:02:53 node1 kubelet[17271]: I0826 14:02:53.524474   17271 kubelet_node_status.go:73] "Attempting to register node" >
Aug 26 14:02:55 node1 kubelet[17271]: E0826 14:02:55.028079   17271 eviction_manager.go:282] "Eviction manager: failed to get>
Aug 26 14:02:55 node1 kubelet[17271]: E0826 14:02:55.739839   17271 controller.go:145] "Failed to ensure lease exists, will r>
Aug 26 14:02:55 node1 kubelet[17271]: E0826 14:02:55.739830   17271 kubelet_node_status.go:96] "Unable to register node with >
Aug 26 14:02:55 node1 systemd[1]: Stopping kubelet: The Kubernetes Node Agent...
Aug 26 14:02:55 node1 systemd[1]: kubelet.service: Deactivated successfully.
Aug 26 14:02:55 node1 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Aug 26 14:02:55 node1 systemd[1]: kubelet.service: Consumed 2.749s CPU time.

Additional information

Other versions of Kubekey do the same thing. I have seen this back to version 3.1.1 The version of etcd is 3.5.13. That is the same version that is used to install k8s via Kubekey for 1.28.8 succesfully. Nodes are in time-sync

lbrigman124 commented 2 weeks ago

These machines are VM and I can recreate them to start from fresh to try different things. Failures extend for any Kubernetes versions past 1.28. Tested 1.29 as listed in the above config file 1.30.4 and 1.31.0 also failed in the same way.