siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.65k stars 531 forks source link

Unable to bootstrap the cluster #6875

Closed Ton618-dz closed 1 year ago

Ton618-dz commented 1 year ago

Bug Report

Description

kube-controller-manager is not running probably due certificate issue. I tried version 1.3.5 and 1.3.2. So no node is listed in the cluster as Kubelet has no approved certificate to use: kube-controller-manager should auto approve the Kubelet client certificate (kubernetes.io/kube-apiserver-client-kubelet).

Logs

Kubelet:

` talosctl logs kubelet --talosconfig $TALOSCONFIG -n 192.168.0.15

192.168.0.15: {"ts":1677009327252.7014,"caller":"cache/reflector.go:140","msg":"vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch v1.RuntimeClass: failed to list v1.RuntimeClass: Unauthorized\n"} 192.168.0.15: {"ts":1677009327948.931,"caller":"lease/controller.go:146","msg":"failed to ensure lease exists, will retry in 7s, error: Unauthorized\n"} 192.168.0.15: {"ts":1677009329529.494,"caller":"kubelet/kubelet_node_status.go:70","msg":"Attempting to register node","v":0,"node":{"name":"talos-dy3-h80"}} 192.168.0.15: {"ts":1677009329531.4927,"caller":"kubelet/kubelet_node_status.go:92","msg":"Unable to register node with API server","node":{"name":"talos-dy3-h80"},"err":"Unauthorized"} 192.168.0.15: {"ts":1677009330101.581,"caller":"certificate/transport.go:112","msg":"No valid client certificate is found but the server is not responsive. A restart may be necessary to retrieve new initial credentials.","lastCertificateAvailabilityTime":1677009330101.6028,"shutdownThreshold":"5m0s"} 192.168.0.15: {"ts":1677009330275.5315,"caller":"eviction/eviction_manager.go:261","msg":"Eviction manager: failed to get summary stats","err":"failed to get node info: node \"talos-dy3-h80\" not found"} 192.168.0.15: {"ts":1677009334587.0415,"caller":"cache/reflector.go:424","msg":"vendor/k8s.io/client-go/informers/factory.go:150: failed to list v1.Node: Unauthorized\n","v":0} 192.168.0.15: {"ts":1677009334587.218,"caller":"cache/reflector.go:140","msg":"vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch v1.Node: failed to list v1.Node: Unauthorized\n"} 192.168.0.15: {"ts":1677009334950.5044,"caller":"lease/controller.go:146","msg":"failed to ensure lease exists, will retry in 7s, error: Unauthorized\n"} 192.168.0.15: {"ts":1677009336535.4844,"caller":"kubelet/kubelet_node_status.go:70","msg":"Attempting to register node","v":0,"node":{"name":"talos-dy3-h80"}} 192.168.0.15: {"ts":1677009336537.971,"caller":"kubelet/kubelet_node_status.go:92","msg":"Unable to register node with API server","node":{"name":"talos-dy3-h80"},"err":"Unauthorized"} 192.168.0.15: {"ts":1677009340102.3455,"caller":"certificate/transport.go:112","msg":"No valid client certificate is found but the server is not responsive. A restart may be necessary to retrieve new initial credentials.","lastCertificateAvailabilityTime":1677009340102.3594,"shutdownThreshold":"5m0s"} 192.168.0.15: {"ts":1677009340239.699,"caller":"cache/reflector.go:424","msg":"vendor/k8s.io/client-go/informers/factory.go:150: failed to list v1.CSIDriver: Unauthorized\n","v":0} 192.168.0.15: {"ts":1677009340239.7598,"caller":"cache/reflector.go:140","msg":"vendor/k8s.io/client-go/informers/factory.go:150: Failed to watch v1.CSIDriver: failed to list v1.CSIDriver: Unauthorized\n"} 192.168.0.15: {"ts":1677009340276.291,"caller":"eviction/eviction_manager.go:261","msg":"Eviction manager: failed to get summary stats","err":"failed to get node info: node \"talos-dy3-h80\" not found"} 192.168.0.15: {"ts":1677009341952.3374,"caller":"lease/controller.go:146","msg":"failed to ensure lease exists, will retry in 7s, error: Unauthorized\n"} 192.168.0.15: {"ts":1677009343539.3875,"caller":"kubelet/kubelet_node_status.go:70","msg":"Attempting to register node","v":0,"node":{"name":"talos-dy3-h80"}} 192.168.0.15: {"ts":1677009343541.0176,"caller":"kubelet/kubelet_node_status.go:92","msg":"Unable to register node with API server","node":{"name":"talos-dy3-h80"},"err":"Unauthorized"} 192.168.0.15: {"ts":1677009348955.2104,"caller":"lease/controller.go:146","msg":"failed to ensure lease exists, will retry in 7s, error: Unauthorized\n"} 192.168.0.15: {"ts":1677009350102.534,"caller":"certificate/transport.go:112","msg":"No valid client certificate is found but the server is not responsive. A restart may be necessary to retrieve new initial credentials.","lastCertificateAvailabilityTime":1677009350102.5476,"shutdownThreshold":"5m0s"} 192.168.0.15: {"ts":1677009350276.8328,"caller":"eviction/eviction_manager.go:261","msg":"Eviction manager: failed to get summary stats","err":"failed to get node info: node \"talos-dy3-h80\" not found"} 192.168.0.15: {"ts":1677009350543.6958,"caller":"kubelet/kubelet_node_status.go:70","msg":"Attempting to register node","v":0,"node":{"name":"talos-dy3-h80"}} 192.168.0.15: {"ts":1677009350545.9856,"caller":"kubelet/kubelet_node_status.go:92","msg":"Unable to register node with API server","node":{"name":"talos-dy3-h80"},"err":"Unauthorized"} `

control-runtime: ` talosctl logs controller-runtime --talosconfig $TALOSCONFIG

192.168.0.15: 2023-02-21T19:57:21.304Z ERROR kubernetes endpoint watch error {"component": "controller-runtime", "controller": "k8s.EndpointController", "error": "failed to list *v1.Endpoints: Get \"https://k8slab.openmind.net:6443/api/v1/namespaces/default/endpoints?fieldSelector=metadata.name%3Dkubernetes&limit=500&resourceVersion=0\": dial tcp 64.190.63.111:6443: i/o timeout"} 192.168.0.15: 2023-02-21T19:57:22.558Z DEBUG controller starting {"component": "controller-runtime", "controller": "k8s.NodeLabelsApplyController"} 192.168.0.15: 2023-02-21T19:57:23.273Z DEBUG NTP response {"component": "controller-runtime", "controller": "time.SyncController", "clock_offset": "46.029077ms", "rtt": "5.613666ms", "leap": 0, "stratum": 2, "precision": "59ns", "root_delay": "77.545166ms", "root_dispersion": "35.644531ms", "root_distance": "77.223947ms"} 192.168.0.15: 2023-02-21T19:57:23.273Z DEBUG sample stats {"component": "controller-runtime", "controller": "time.SyncController", "jitter": "17.397355ms", "poll_interval": "1m4s", "spike": false} 192.168.0.15: 2023-02-21T19:57:23.273Z DEBUG adjusting time (slew) by 46.029077ms via 216.6.2.70, state TIME_OK, status STA_PLL | STA_NANO {"component": "controller-runtime", "controller": "time.SyncController"} 192.168.0.15: 2023-02-21T19:57:23.273Z DEBUG adjtime state {"component": "controller-runtime", "controller": "time.SyncController", "constant": 3, "offset": "46.029076ms", "freq_offset": 5891721, "freq_offset_ppm": 89} 192.168.0.15: 2023-02-21T19:57:24.260Z DEBUG updated controlplane endpoints {"component": "controller-runtime", "controller": "k8s.EndpointController", "endpoints": ["192.168.0.15"]} 192.168.0.15: 2023-02-21T19:57:24.273Z ERROR controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"} 192.168.0.15: 2023-02-21T19:57:24.295Z DEBUG restarting controller in 1.038807285s {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController"} 192.168.0.15: 2023-02-21T19:57:24.346Z ERROR controller failed {"component": "controller-runtime", "controller": "k8s.NodeLabelsApplyController", "error": "1 error(s) occurred:\n\terror getting node: nodes \"talos-dy3-h80\" not found"} 192.168.0.15: 2023-02-21T19:57:24.364Z DEBUG restarting controller in 21.616111842s {"component": "controller-runtime", "controller": "k8s.NodeLabelsApplyController"} 192.168.0.15: 2023-02-21T19:57:25.079Z DEBUG controller starting {"component": "controller-runtime", "controller": "k8s.ManifestApplyController"} 192.168.0.15: 2023-02-21T19:57:25.097Z DEBUG waiting for mutex {"component": "controller-runtime", "controller": "k8s.ManifestApplyController", "key": "talos:v1:manifestApplyMutex"} 192.168.0.15: 2023-02-21T19:57:25.105Z DEBUG mutex acquired {"component": "controller-runtime", "controller": "k8s.ManifestApplyController", "key": "talos:v1:manifestApplyMutex"} 192.168.0.15: 2023-02-21T19:57:25.334Z DEBUG controller starting {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController"} 192.168.0.15: 2023-02-21T19:57:40.339Z ERROR controller failed {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController", "error": "error refreshing pod status: error fetching pod status: an error on the server (\"Authorization error (user=apiserver-kubelet-client, verb=get, resource=nodes, subresource=proxy)\") has prevented the request from succeeding"} 192.168.0.15: 2023-02-21T19:57:40.363Z DEBUG restarting controller in 755.600581ms {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController"} 192.168.0.15: 2023-02-21T19:57:41.118Z DEBUG controller starting {"component": "controller-runtime", "controller": "k8s.KubeletStaticPodController"} 192.168.0.15: 2023-02-21T19:57:45.981Z DEBUG controller starting {"component": "controller-runtime", "controller": "k8s.NodeLabelsApplyController"} 192.168.0.15: 2023-02-21T19:57:45.991Z ERROR controller failed {"component": "controller-runtime", "controller": "k8s.NodeLabelsApplyController", "error": "1 error(s) occurred:\n\terror getting node: nodes \"talos-dy3-h80\" not found"} 192.168.0.15: 2023-02-21T19:57:46.001Z DEBUG restarting controller in 20.587502495s {"component": "controller-runtime", "controller": "k8s.NodeLabelsApplyController"} `

Environment

` alpine:~/k8s$ talosctl version --talosconfig $TALOSCONFIG -n 192.168.0.15 Client: Tag: v1.3.5 SHA: 03edf8c1 Built:
Go version: go1.19.6 OS/Arch: linux/amd64 Server: NODE: 192.168.0.15 Tag: v1.3.5 SHA: 03edf8c1 Built:
Go version: go1.19.6 OS/Arch: linux/amd64 Enabled: RBAC

`

script used: ` K8s_HOME_USER=$HOME/k8s k8s_ClusterName=k8s-home K8s_CONFIG=$K8s_HOME_USER/k8s-config/$k8s_ClusterName k8s_tools=$K8s_HOME_USER/talos_tools

CONTROLPLANCONFIG=$K8s_CONFIG/controlplane.yaml WORKERCONFIG=$K8s_CONFIG/worker.yaml KUBECONFIG=$K8sCONFIG/Kubeconfig$k8s_ClusterName CONTROL_PLANE_IP=192.168.0.15 END_POINT_IP=k8slab.openmind.net WORKER_IP=192.168.0.16 Talos_version=1.3.5 K8s_version=1.26.1

export TALOSCONFIG=$K8s_CONFIG/talosconfig

mkdir -p $k8s_tools cd $k8s_tools

curl https://github.com/siderolabs/talos/releases/download/v$Talos_version/talosctl-linux-amd64 -L -o talosctl sudo cp talosctl /usr/local/bin sudo chmod +x /usr/local/bin/talosctl

curl https://github.com/siderolabs/talos/releases/download/v$Talos_version/talos-amd64.iso -L -o talos-amd64-v$Talos_version.iso

cd $K8s_HOME_USER rm -rf $K8s_CONFIG

talosctl gen config $k8s_ClusterName https://$END_POINT_IP:6443 --output-dir $K8s_CONFIG --kubernetes-version $K8s_version --install-disk "/dev/vda" --additional-sans $CONTROL_PLANE_IP

talosctl config endpoint --talosconfig $TALOSCONFIG $END_POINT_IP

talosctl config node --talosconfig $TALOSCONFIG $CONTROL_PLANE_IP

talosctl apply-config --insecure --nodes $CONTROL_PLANE_IP --file $CONTROLPLANCONFIG

talosctl --talosconfig $TALOSCONFIG bootstrap -n $CONTROL_PLANE_IP

`

CSR certificates:

alpine:~/k8s$ kubectl get csr --kubeconfig $KUBECONFIG NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-4cxxq 80m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:zmgozn <none> Pending csr-7fmth 64m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:zmgozn <none> Pending csr-8z85f 18m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:zmgozn <none> Pending csr-bxsjl 135m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:zmgozn <none> Pending csr-c56xp 104m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:zmgozn <none> Pending csr-kjh52 119m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:zmgozn <none> Pending csr-mbtvq 150m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:zmgozn <none> Pending csr-p8s96 49m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:zmgozn <none> Pending csr-wz75z 33m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:zmgozn <none> Pending csr-z5gr2 88m kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:zmgozn <none> Pending csr-zfqs8 9m21s kubernetes.io/kube-apiserver-client-kubelet system:bootstrap:zmgozn <none> Pending

Static Pods:

alpine:~/k8s$ talosctl get staticpods --talosconfig $TALOSCONFIG NODE NAMESPACE TYPE ID VERSION 192.168.0.15 k8s StaticPod kube-apiserver 1 192.168.0.15 k8s StaticPod kube-controller-manager 1 192.168.0.15 k8s StaticPod kube-scheduler 1

Services:

alpine:~/k8s$ talosctl get services --talosconfig $TALOSCONFIG NODE NAMESPACE TYPE ID VERSION RUNNING HEALTHY HEALTH UNKNOWN 192.168.0.15 runtime Service apid 2 true true false 192.168.0.15 runtime Service containerd 3 true true false 192.168.0.15 runtime Service cri 2 true true false 192.168.0.15 runtime Service etcd 2 true true false 192.168.0.15 runtime Service kubelet 2 true true false 192.168.0.15 runtime Service machined 2 true true false 192.168.0.15 runtime Service trustd 2 true true false 192.168.0.15 runtime Service udevd 2 true true false containers:

alpine:~/k8s$ talosctl containers -k --talosconfig $TALOSCONFIG NODE NAMESPACE ID IMAGE PID STATUS 192.168.0.15 k8s.io kube-system/kube-apiserver-talos-dy3-h80 registry.k8s.io/pause:3.6 2440 SANDBOX_READY 192.168.0.15 k8s.io └─ kube-system/kube-apiserver-talos-dy3-h80:kube-apiserver registry.k8s.io/kube-apiserver:v1.26.1 2506 CONTAINER_RUNNING 192.168.0.15 k8s.io kube-system/kube-scheduler-talos-dy3-h80 registry.k8s.io/pause:3.6 2433 SANDBOX_READY 192.168.0.15 k8s.io └─ kube-system/kube-scheduler-talos-dy3-h80:kube-scheduler registry.k8s.io/kube-scheduler:v1.26.1 0 CONTAINER_EXITED 192.168.0.15 k8s.io └─ kube-system/kube-scheduler-talos-dy3-h80:kube-scheduler registry.k8s.io/kube-scheduler:v1.26.1 2591 CONTAINER_RUNNING

smira commented 1 year ago

It's very hard to read, as it's not formatted properly.

Can you do talosctl support -n <IP1>,<IP2>,.. (for all nodes in the cluster), and attach the support.zip to the issue?

Ton618-dz commented 1 year ago

Sorry for the formatting

. Unfortunately, I tried to put the code/result using the add code section but for some reason it didn't work.

Please find attached the support.zip file

support.zip

Regards,

smira commented 1 year ago
user: warning: [2023-02-23T00:16:50.343757878Z]: [talos] WARNING: memory size is less than recommended
user: warning: [2023-02-23T00:16:50.346321878Z]: [talos] WARNING: Talos may not work properly
user: warning: [2023-02-23T00:16:50.348920878Z]: [talos] WARNING: minimum memory size is 1898 MiB
user: warning: [2023-02-23T00:16:50.351350878Z]: [talos] WARNING: recommended memory size is 3946 MiB
user: warning: [2023-02-23T00:16:50.353776878Z]: [talos] WARNING: current total memory size is 904 MiB

you don't have enough memory to run Kubernetes controlplane node.