rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.54k stars 266 forks source link

rke2-server failing to start #7011

Closed cdmx1 closed 4 days ago

cdmx1 commented 5 days ago

Environmental Info: Rancher 1.29.9 on Amazon Linux 2, using RKE2. RKE2 Version: rke2 version v1.30.4+rke2r1 (9517eea519b780e154dd791c555c698e84a0e5cd)

Node(s) CPU architecture, OS, and Version: Linux ip-30-0-1xx-1xx.ec2.internal 5.10.xxx-xxx.879.amzn2.x86_64 #1 SMP Tue Sep 24 01:40:52 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration: 1 master node / 1 worker node

Describe the bug: Rancher agent enrolment gets stuck during the setup process with messages like:

Reconciling
Waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler

Steps To Reproduce: Installed Rancher agent using the command provided in the Rancher UI for agent enrollment. curl -fL https://rancher.example.in/system-agent-install.sh | sudo sh -s - --server https://rancher.example.in --label 'cattle.io/os=linux' --token xxxxxxxxxxxx --etcd --controlplane --worker Agent enrollment gets stuck with the mentioned errors.

Expected behavior: Rancher agent should successfully enroll into the RKE2 cluster, and the pods should start normally without networking errors.

Actual behavior: Rancher agent enrollment gets stuck with the message: Reconciling Waiting for probes: kube-apiserver, kube-controller-manager, kube-scheduler

Error Log Details:

ERRO[0003] Sending HTTP 503 response to 127.0.0.1:49552: runtime core not ready
INFO[0003] Running kube-proxy ...
ERRO[0008] Kubelet exited: exit status 1
ERRO[0013] Kubelet exited: exit status 1
ERRO[0023] Kubelet exited: exit status 1
INFO[0020] Pod for etcd not synced (pod sandbox not found), retrying
{"level":"warn", "msg":"retrying of unary invoker failed", "error":"tls: failed to verify certificate: x509: certificate signed by unknown authority"}
INFO[0030] Failed to test data store connection: context deadline exceeded
ERRO[0033] Failed to check local etcd status for learner management: context deadline exceeded
brandond commented 4 days ago
ERRO[0008] Kubelet exited: exit status 1
ERRO[0013] Kubelet exited: exit status 1
ERRO[0023] Kubelet exited: exit status 1

The kubelet is crashing. This is usually caused by use of incorrect kubelet-args in the cluster configuration, but it could be due to other things. Check the kubelet log at /var/lib/rancher/rke2/agent/logs/kubelet.log to see what the errors are.