[BUG] fail to add master node into newly created cluser, pending at "Waiting for Node Ref", on Rancher v2.7.9

szhengli commented 11 months ago

OS Version: Rocky Linux 9.3 (9.2 also tried)

RKE2 , in which rancher is deployed by helm root@m0:~# rke2 -version rke2 version v1.26.10+rke2r2 (21e3a8c82da71473f2b846065dcab197a9b2c9d8) go version go1.20.10 X:boringcrypto

Rancher: v2.7.9

after created the cluster in rancher UI, then try to install a control & etcd plane, by runing curl -fL https://rancher2.cnzhonglunnet.com/system-agent-install.sh | sudo sh -s - --server https://rancher2.cnzhonglunnet.com --label 'cattle.io/os=linux' --token 2bmwq69ff9vh4swfhqdgrh74bwgwbslv2c92hf4shpv58rsw92w5b8 --etcd --controlplane , on a VM ( with Rocky Linux 9.3 (9.2 also tried) ). the status of the machine in rancher UI, will be pending at "Waiting for Node Ref". as follow

the VM message log is as following:

rancher pod err:

the problem can be easily re-produced. I have tried on Rancher 2.7.8, and on rocky ubuntu LTS 22, as well. same error.

any fix or clue on this issue? thanks

Oats87 commented 11 months ago

On the node that you registered, please check the logs of rancher-system-agent i.e. journalctl -u rancher-system-agent to see if it is having errors applying your plan.

From the logs you posted, at 13:19:36, the agent received a plan secret and should be starting to attempt to apply the plan.

szhengli commented 11 months ago

On the node that you registered, please check the logs of rancher-system-agent i.e. journalctl -u rancher-system-agent to see if it is having errors applying your plan.

From the logs you posted, at 13:19:36, the agent received a plan secret and should be starting to attempt to apply the plan.

here is the output for the cmd you mentioned. just

[root@m0921 ~]# journalctl -u rancher-system-agent Dec 07 17:28:26 m0921 systemd[1]: Started Rancher System Agent. Dec 07 17:28:26 m0921 rancher-system-agent[1887]: time="2023-12-07T17:28:26+08:00" level=info msg="Rancher System Agent version v0.3.3 (9e827a5) is starting" Dec 07 17:28:26 m0921 rancher-system-agent[1887]: time="2023-12-07T17:28:26+08:00" level=info msg="Using directory /var/lib/rancher/agent/work for work" Dec 07 17:28:26 m0921 rancher-system-agent[1887]: time="2023-12-07T17:28:26+08:00" level=info msg="Starting remote watch of plans" Dec 07 17:28:26 m0921 rancher-system-agent[1887]: E1207 17:28:26.922398 1887 memcache.go:206] couldn't get resource list for management.cattle.io/v3: Dec 07 17:28:26 m0921 rancher-system-agent[1887]: time="2023-12-07T17:28:26+08:00" level=info msg="Starting /v1, Kind=Secret controller"

somehow, it never apply the plan.

dylanthepodman commented 9 months ago

I am also experiencing the exact same issue, on Rancher 2.7.6. I'm trying to create a new RKE2 custom"dev" cluster on machines running Ubuntu 22.04 LTS, . Any clue as to what is causing this?

Here are the logs from the rancher-system-agent on the machine i'm trying to provision

Feb 09 16:26:00 kube-dev-ctrl-01 rancher-system-agent[5946]: time="2024-02-09T16:26:00Z" level=info msg="Rancher System Agent version v0.3.3 (9e827a5) is starting"
Feb 09 16:26:00 kube-dev-ctrl-01 rancher-system-agent[5946]: time="2024-02-09T16:26:00Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Feb 09 16:26:00 kube-dev-ctrl-01 rancher-system-agent[5946]: time="2024-02-09T16:26:00Z" level=info msg="Starting remote watch of plans"
Feb 09 16:26:01 kube-dev-ctrl-01 rancher-system-agent[5946]: E0209 16:26:01.523825    5946 memcache.go:206] couldn't get resource list for management.cattle.io/v3:
Feb 09 16:26:01 kube-dev-ctrl-01 rancher-system-agent[5946]: time="2024-02-09T16:26:01Z" level=info msg="Starting /v1, Kind=Secret controller"

And here are the logs from one of the Rancher pods.

2024/02/09 16:26:03 [INFO] [planner] rkecluster fleet-default/dev: waiting for at least one control plane, etcd, and worker node to be registered
2024/02/09 16:26:03 [ERROR] [rkebootstrap] fleet-default/custom-81bdf9ef4174: error getting machine by owner reference no matching controller owner ref
2024/02/09 16:26:03 [ERROR] error syncing 'fleet-default/custom-81bdf9ef4174': handler rke-bootstrap: no matching controller owner ref, requeuing
2024/02/09 16:26:03 [ERROR] [rkebootstrap] fleet-default/custom-81bdf9ef4174: error getting machine by owner reference no matching controller owner ref
2024/02/09 16:26:03 [ERROR] error syncing 'fleet-default/custom-81bdf9ef4174': handler rke-bootstrap: no matching controller owner ref, requeuing
2024/02/09 16:26:03 [ERROR] [rkebootstrap] fleet-default/custom-81bdf9ef4174: error getting machine by owner reference no matching controller owner ref
2024/02/09 16:26:03 [ERROR] error syncing 'fleet-default/custom-81bdf9ef4174': handler rke-bootstrap: no matching controller owner ref, requeuing
2024/02/09 16:26:04 [ERROR] [rkebootstrap] fleet-default/custom-81bdf9ef4174: error getting machine by owner reference no matching controller owner ref
2024/02/09 16:26:04 [ERROR] error syncing 'fleet-default/custom-81bdf9ef4174': handler rke-bootstrap: no matching controller owner ref, requeuing
2024/02/09 16:26:04 [ERROR] [rkebootstrap] fleet-default/custom-81bdf9ef4174: error getting machine by owner reference no matching controller owner ref
2024/02/09 16:26:04 [ERROR] error syncing 'fleet-default/custom-81bdf9ef4174': handler rke-bootstrap: no matching controller owner ref, requeuing
2024/02/09 16:26:04 [INFO] [planner] rkecluster fleet-default/dev: waiting for at least one control plane, etcd, and worker node to be registered

dylanthepodman commented 8 months ago

I've upgraded to Rancher 2.8.2, but im still having this issue. Anyone else having this?

dylanthepodman commented 7 months ago

Anybody? I've also built a brand new Rancher cluster, and I am still having the exact same issue. As a test, I built another cluster using Rancher v2.6.14 and it does not have this issue.

Trojan295 commented 7 months ago

I'm having the same issue on Rancher v2.8.2. Bootstrap node is stuck in Waiting for Node Ref. I tried removing the clusters, using other fresh VMs for the bootstrap node, other CNIs. Basically I'm not able to provision any new custom cluster in Rancher. Furthermore, I see nothing suspicious in the logs.

Strange thing is, that I was able to create a cluster few days ago. Suddenly it stopped, without doing any changes to Rancher itself. The control plane on the cluster itself is running and I can interact using kubectl with it.

root@localhost:~# /var/lib/rancher/rke2/bin/kubectl get pods -A
NAMESPACE             NAME                                                   READY   STATUS      RESTARTS   AGE
cattle-fleet-system   fleet-agent-7f9ccfb8b-b5vd8                            1/1     Running     0          26m
cattle-system         cattle-cluster-agent-664dc5fb7-n7vxw                   1/1     Running     0          6m39s
cattle-system         helm-operation-b8gh2                                   0/2     Completed   0          26m
cattle-system         rancher-webhook-7dc6679459-8k9bm                       1/1     Running     0          26m
cattle-system         system-upgrade-controller-78cfb99bb7-cbwlp             1/1     Running     0          26m
kube-system           cilium-9jz5d                                           1/1     Running     0          27m
kube-system           cilium-operator-86d6785dd8-ftgwc                       0/1     Pending     0          27m
kube-system           cilium-operator-86d6785dd8-gxcrq                       1/1     Running     0          27m
kube-system           cloud-controller-manager-localhost                     1/1     Running     0          28m
kube-system           etcd-localhost                                         1/1     Running     0          28m
kube-system           helm-install-rke2-cilium-nx6tj                         0/1     Completed   0          28m
kube-system           helm-install-rke2-coredns-q4ppd                        0/1     Completed   0          28m
kube-system           helm-install-rke2-snapshot-controller-crd-rxcz2        0/1     Completed   0          28m
kube-system           helm-install-rke2-snapshot-controller-f5tcf            0/1     Completed   1          28m
kube-system           helm-install-rke2-snapshot-validation-webhook-bnswn    0/1     Completed   0          28m
kube-system           kube-apiserver-localhost                               1/1     Running     0          28m
kube-system           kube-controller-manager-localhost                      1/1     Running     0          28m
kube-system           kube-proxy-localhost                                   1/1     Running     0          28m
kube-system           kube-scheduler-localhost                               1/1     Running     0          28m
kube-system           rke2-coredns-rke2-coredns-94745d-vl5pv                 1/1     Running     0          27m
kube-system           rke2-coredns-rke2-coredns-autoscaler-d8587b89c-gt4vw   1/1     Running     0          27m
kube-system           rke2-snapshot-controller-7d6476d7cb-mjbbw              1/1     Running     0          27m
kube-system           rke2-snapshot-validation-webhook-5649fbd66c-s6l7q      1/1     Running     0          27m

Last logs from rancher-system-agent:

Mar 27 08:16:32 localhost rancher-system-agent[1084]: time="2024-03-27T08:16:32Z" level=info msg="[Applyinator] No image provided, creating empty working directory /var/lib/rancher/agent/work/20240327-081632/a9a590004a40fe89f88d887599130d4b67381fd410b6ce980978a5199d509dfc_0"
Mar 27 08:16:32 localhost rancher-system-agent[1084]: time="2024-03-27T08:16:32Z" level=info msg="[Applyinator] Running command: sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null]"
Mar 27 08:16:32 localhost rancher-system-agent[1084]: time="2024-03-27T08:16:32Z" level=info msg="[a9a590004a40fe89f88d887599130d4b67381fd410b6ce980978a5199d509dfc_0:stdout]: Name Location Size Created"
Mar 27 08:16:32 localhost rancher-system-agent[1084]: time="2024-03-27T08:16:32Z" level=info msg="[Applyinator] Command sh [-c rke2 etcd-snapshot list --etcd-s3=false 2>/dev/null] finished with err: <nil> and exit code: 0"
Mar 27 08:16:32 localhost rancher-system-agent[1084]: time="2024-03-27T08:16:32Z" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: open /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: no such file or directory"
Mar 27 08:16:32 localhost rancher-system-agent[1084]: time="2024-03-27T08:16:32Z" level=error msg="error while appending ca cert to pool for probe kube-controller-manager"
Mar 27 08:16:32 localhost rancher-system-agent[1084]: time="2024-03-27T08:16:32Z" level=error msg="error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: open /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: no such file or directory"
Mar 27 08:16:32 localhost rancher-system-agent[1084]: time="2024-03-27T08:16:32Z" level=error msg="error while appending ca cert to pool for probe kube-scheduler"
Mar 27 08:16:32 localhost rancher-system-agent[1084]: time="2024-03-27T08:16:32Z" level=info msg="[K8s] updated plan secret fleet-default/custom-7cba866a2c6b-machine-plan with feedback"
Mar 27 08:16:32 localhost rancher-system-agent[1084]: time="2024-03-27T08:16:32Z" level=error msg="error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: open /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-scheduler.crt: no such file or directory"
Mar 27 08:16:32 localhost rancher-system-agent[1084]: time="2024-03-27T08:16:32Z" level=error msg="error while appending ca cert to pool for probe kube-scheduler"
Mar 27 08:16:32 localhost rancher-system-agent[1084]: time="2024-03-27T08:16:32Z" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: open /var/lib/rancher/rke2/server/tls/kube-controller-manager/kube-controller-manager.crt: no such file or directory"
Mar 27 08:16:32 localhost rancher-system-agent[1084]: time="2024-03-27T08:16:32Z" level=error msg="error while appending ca cert to pool for probe kube-controller-manager"
Mar 27 08:16:37 localhost rancher-system-agent[1084]: time="2024-03-27T08:16:37Z" level=info msg="[K8s] updated plan secret fleet-default/custom-7cba866a2c6b-machine-plan with feedback"

Last log from rke2-server:

Mar 27 08:16:09 localhost rke2[1231]: time="2024-03-27T08:16:09Z" level=info msg="Running kubelet --address=0.0.0.0 --alsologtostderr=false --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=systemd --client-ca-file=/var/lib/rancher/rke2/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --containerd=/run/k3s/containerd/containerd.sock --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --feature-gates=CloudDualStackNodeIPs=true --healthz-bind-address=127.0.0.1 --hostname-override=localhost --kubeconfig=/var/lib/rancher/rke2/agent/kubelet.kubeconfig --log-file=/var/lib/rancher/rke2/agent/logs/kubelet.log --log-file-max-size=50 --logtostderr=false --node-ip=172.105.86.148,2a01:7e01::f03c:94ff:fe4f:e035 --node-labels=cattle.io/os=linux,rke.cattle.io/machine=3ed38de2-c6b7-4f16-9fb8-8fb7a0caf4e4 --pod-infra-container-image=index.docker.io/rancher/mirrored-pause:3.6 --pod-manifest-path=/var/lib/rancher/rke2/agent/pod-manifests --read-only-port=0 --resolv-conf=/run/systemd/resolve/resolv.conf --serialize-image-pulls=false --stderrthreshold=FATAL --tls-cert-file=/var/lib/rancher/rke2/agent/serving-kubelet.crt --tls-private-key-file=/var/lib/rancher/rke2/agent/serving-kubelet.key"
Mar 27 08:16:09 localhost rke2[1231]: time="2024-03-27T08:16:09Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Mar 27 08:16:14 localhost rke2[1231]: time="2024-03-27T08:16:14Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Mar 27 08:16:16 localhost rke2[1231]: {"level":"warn","ts":"2024-03-27T08:16:16.410048Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000c5efc0/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
Mar 27 08:16:16 localhost rke2[1231]: {"level":"info","ts":"2024-03-27T08:16:16.410153Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
Mar 27 08:16:18 localhost rke2[1231]: time="2024-03-27T08:16:18Z" level=info msg="Defragmenting etcd database"
Mar 27 08:16:18 localhost rke2[1231]: time="2024-03-27T08:16:18Z" level=info msg="etcd data store connection OK"
Mar 27 08:16:18 localhost rke2[1231]: time="2024-03-27T08:16:18Z" level=info msg="Saving cluster bootstrap data to datastore"
Mar 27 08:16:18 localhost rke2[1231]: time="2024-03-27T08:16:18Z" level=info msg="Waiting for API server to become available"
Mar 27 08:16:19 localhost rke2[1231]: time="2024-03-27T08:16:19Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Mar 27 08:16:21 localhost rke2[1231]: time="2024-03-27T08:16:21Z" level=info msg="Pod for etcd is synced"
Mar 27 08:16:21 localhost rke2[1231]: time="2024-03-27T08:16:21Z" level=info msg="Pod for kube-apiserver is synced"
Mar 27 08:16:24 localhost rke2[1231]: time="2024-03-27T08:16:24Z" level=warning msg="Failed to list nodes with etcd role: runtime core not ready"
Mar 27 08:16:24 localhost rke2[1231]: time="2024-03-27T08:16:24Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Mar 27 08:16:29 localhost rke2[1231]: time="2024-03-27T08:16:29Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Mar 27 08:16:31 localhost rke2[1231]: time="2024-03-27T08:16:31Z" level=info msg="Waiting for API server to become available"
Mar 27 08:16:31 localhost rke2[1231]: time="2024-03-27T08:16:31Z" level=info msg="Waiting for cloud-controller-manager privileges to become available"
Mar 27 08:16:31 localhost rke2[1231]: time="2024-03-27T08:16:31Z" level=info msg="Kube API server is now running"
Mar 27 08:16:31 localhost rke2[1231]: time="2024-03-27T08:16:31Z" level=info msg="Creating rke2-supervisor event broadcaster"
Mar 27 08:16:31 localhost rke2[1231]: time="2024-03-27T08:16:31Z" level=info msg="Watching for delete of localhost Node object"
Mar 27 08:16:31 localhost rke2[1231]: time="2024-03-27T08:16:31Z" level=info msg="Applying Cluster Role Bindings"
Mar 27 08:16:31 localhost rke2[1231]: time="2024-03-27T08:16:31Z" level=info msg="Applying CRD addons.k3s.cattle.io"
Mar 27 08:16:31 localhost rke2[1231]: time="2024-03-27T08:16:31Z" level=info msg="Applying CRD etcdsnapshotfiles.k3s.cattle.io"
Mar 27 08:16:31 localhost rke2[1231]: time="2024-03-27T08:16:31Z" level=info msg="Applying CRD helmcharts.helm.cattle.io"
Mar 27 08:16:32 localhost rke2[1231]: time="2024-03-27T08:16:32Z" level=info msg="Applying CRD helmchartconfigs.helm.cattle.io"
Mar 27 08:16:32 localhost rke2[1231]: time="2024-03-27T08:16:32Z" level=info msg="Annotations and labels have been set successfully on node: localhost"
Mar 27 08:16:34 localhost rke2[1231]: time="2024-03-27T08:16:34Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
Mar 27 08:16:34 localhost rke2[1231]: time="2024-03-27T08:16:34Z" level=info msg="Cluster Role Bindings applied successfully"
Mar 27 08:16:34 localhost rke2[1231]: time="2024-03-27T08:16:34Z" level=info msg="ETCD server is now running"
Mar 27 08:16:34 localhost rke2[1231]: time="2024-03-27T08:16:34Z" level=info msg="rke2 is up and running"
Mar 27 08:16:34 localhost systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Mar 27 08:16:34 localhost rke2[1231]: time="2024-03-27T08:16:34Z" level=info msg="Failed to get existing traefik HelmChart" error="helmcharts.helm.cattle.io \"traefik\" not found"
Mar 27 08:16:34 localhost rke2[1231]: time="2024-03-27T08:16:34Z" level=info msg="Reconciling ETCDSnapshotFile resources"
Mar 27 08:16:34 localhost rke2[1231]: time="2024-03-27T08:16:34Z" level=info msg="Starting managed etcd node metadata controller"
Mar 27 08:16:34 localhost rke2[1231]: time="2024-03-27T08:16:34Z" level=info msg="Tunnel server egress proxy mode: agent"
Mar 27 08:16:34 localhost rke2[1231]: time="2024-03-27T08:16:34Z" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Starting k3s.cattle.io/v1, Kind=Addon controller"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Creating deploy event broadcaster"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Starting /v1, Kind=Node controller"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Cluster dns configmap has been set successfully"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Starting managed etcd apiserver addresses controller"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Starting managed etcd member removal controller"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Starting managed etcd snapshot ConfigMap controller"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Creating helm-controller event broadcaster"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Creating reencrypt-controller event broadcaster"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Starting k3s.cattle.io/v1, Kind=ETCDSnapshotFile controller"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Starting helm.cattle.io/v1, Kind=HelmChart controller"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Starting helm.cattle.io/v1, Kind=HelmChartConfig controller"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Starting rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding controller"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Starting batch/v1, Kind=Job controller"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Starting /v1, Kind=ConfigMap controller"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Starting /v1, Kind=ServiceAccount controller"
Mar 27 08:16:35 localhost rke2[1231]: time="2024-03-27T08:16:35Z" level=info msg="Starting /v1, Kind=Secret controller"
Mar 27 08:16:36 localhost rke2[1231]: time="2024-03-27T08:16:36Z" level=info msg="Active TLS secret kube-system/rke2-serving (ver=340) (count 10): map[listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-172.105.86.148:172.105.86.148 listener.cattle.io/cn-2a01_7e01__f03c_94ff_fe4f_e035-c569d0:2a01:7e01::f03c:94ff:fe4f:e035 listener.cattle.io/cn-__1-f16284:::1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc:kubernetes.default.svc listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=C382624516462EAA2F64D72CC2BF1D48BF49AA38]"
Mar 27 08:16:36 localhost rke2[1231]: time="2024-03-27T08:16:36Z" level=error msg="error syncing 'kube-system/rke2-cilium': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-cilium\" not found, requeuing"
Mar 27 08:16:36 localhost rke2[1231]: time="2024-03-27T08:16:36Z" level=info msg="Stopped tunnel to 127.0.0.1:9345"
Mar 27 08:16:36 localhost rke2[1231]: time="2024-03-27T08:16:36Z" level=info msg="Connecting to proxy" url="wss://172.105.86.148:9345/v1-rke2/connect"
Mar 27 08:16:36 localhost rke2[1231]: time="2024-03-27T08:16:36Z" level=info msg="Proxy done" err="context canceled" url="wss://127.0.0.1:9345/v1-rke2/connect"
Mar 27 08:16:36 localhost rke2[1231]: time="2024-03-27T08:16:36Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
Mar 27 08:16:36 localhost rke2[1231]: time="2024-03-27T08:16:36Z" level=info msg="Handling backend connection request [localhost]"
Mar 27 08:16:36 localhost rke2[1231]: time="2024-03-27T08:16:36Z" level=info msg="Reconciling snapshot ConfigMap data"
Mar 27 08:16:36 localhost rke2[1231]: time="2024-03-27T08:16:36Z" level=error msg="error syncing 'kube-system/rke2-coredns': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-coredns\" not found, requeuing"
Mar 27 08:16:37 localhost rke2[1231]: time="2024-03-27T08:16:37Z" level=info msg="Updating TLS secret for kube-system/rke2-serving (count: 10): map[listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-172.105.86.148:172.105.86.148 listener.cattle.io/cn-2a01_7e01__f03c_94ff_fe4f_e035-c569d0:2a01:7e01::f03c:94ff:fe4f:e035 listener.cattle.io/cn-__1-f16284:::1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc:kubernetes.default.svc listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=C382624516462EAA2F64D72CC2BF1D48BF49AA38]"
Mar 27 08:16:37 localhost rke2[1231]: time="2024-03-27T08:16:37Z" level=info msg="Labels and annotations have been set successfully on node: localhost"
Mar 27 08:16:37 localhost rke2[1231]: time="2024-03-27T08:16:37Z" level=error msg="error syncing 'kube-system/rke2-snapshot-controller-crd': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-snapshot-controller-crd\" not found, requeuing"
Mar 27 08:16:38 localhost rke2[1231]: time="2024-03-27T08:16:38Z" level=error msg="error syncing 'kube-system/rke2-snapshot-controller': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-snapshot-controller\" not found, requeuing"
Mar 27 08:16:39 localhost rke2[1231]: time="2024-03-27T08:16:39Z" level=info msg="Adding node localhost etcd status condition"
Mar 27 08:16:39 localhost rke2[1231]: time="2024-03-27T08:16:39Z" level=error msg="error syncing 'kube-system/rke2-snapshot-validation-webhook': handler helm-controller-chart-registration: helmcharts.helm.cattle.io \"rke2-snapshot-validation-webhook\" not found, requeuing"
Mar 27 08:16:39 localhost rke2[1231]: time="2024-03-27T08:16:39Z" level=info msg="Running kube-proxy --cluster-cidr=10.42.0.0/16 --conntrack-max-per-core=0 --conntrack-tcp-timeout-close-wait=0s --conntrack-tcp-timeout-established=0s --healthz-bind-address=127.0.0.1 --hostname-override=localhost --kubeconfig=/var/lib/rancher/rke2/agent/kubeproxy.kubeconfig --proxy-mode=iptables"
Mar 27 08:17:18 localhost rke2[1231]: time="2024-03-27T08:17:18Z" level=info msg="Tunnel authorizer set Kubelet Port 10250"
Mar 27 08:18:06 localhost rke2[1231]: time="2024-03-27T08:18:06Z" level=info msg="Updating TLS secret for kube-system/rke2-serving (count: 11): map[field.cattle.io/projectId:c-m-vd4n4gmr:p-t89rc listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-172.105.86.148:172.105.86.148 listener.cattle.io/cn-2a01_7e01__f03c_94ff_fe4f_e035-c569d0:2a01:7e01::f03c:94ff:fe4f:e035 listener.cattle.io/cn-__1-f16284:::1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc:kubernetes.default.svc listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=C382624516462EAA2F64D72CC2BF1D48BF49AA38]"
Mar 27 08:18:06 localhost rke2[1231]: time="2024-03-27T08:18:06Z" level=info msg="Active TLS secret kube-system/rke2-serving (ver=1667) (count 11): map[field.cattle.io/projectId:c-m-vd4n4gmr:p-t89rc listener.cattle.io/cn-10.43.0.1:10.43.0.1 listener.cattle.io/cn-127.0.0.1:127.0.0.1 listener.cattle.io/cn-172.105.86.148:172.105.86.148 listener.cattle.io/cn-2a01_7e01__f03c_94ff_fe4f_e035-c569d0:2a01:7e01::f03c:94ff:fe4f:e035 listener.cattle.io/cn-__1-f16284:::1 listener.cattle.io/cn-kubernetes:kubernetes listener.cattle.io/cn-kubernetes.default:kubernetes.default listener.cattle.io/cn-kubernetes.default.svc:kubernetes.default.svc listener.cattle.io/cn-kubernetes.default.svc.cluster.local:kubernetes.default.svc.cluster.local listener.cattle.io/cn-localhost:localhost listener.cattle.io/fingerprint:SHA1=C382624516462EAA2F64D72CC2BF1D48BF49AA38]"

baskinsy commented 7 months ago

I have the same issue, posted a new issue here: https://github.com/rancher/rancher/issues/44939

In my case rancher-webhook fails to install and possibly that stops the bootstrapping. This is from the control plane first node of the to be provisioned downstream cluster.

/var/lib/rancher/rke2/bin/crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock logs 9f00d5465f44b

time="2024-03-27T08:11:34Z" level=error msg="Failed to install system chart rancher-webhook: pod cattle-system/helm-operation-s7zxv failed, watch closed"
time="2024-03-27T08:12:36Z" level=error msg="Failed to install system chart rancher-webhook: pod cattle-system/helm-operation-bkzjs failed, watch closed"
time="2024-03-27T08:16:34Z" level=error msg="Failed to install system chart rancher-webhook: pod cattle-system/helm-operation-klw7b failed, watch closed"
time="2024-03-27T08:17:36Z" level=error msg="Failed to install system chart rancher-webhook: pod cattle-system/helm-operation-9b8f5 failed, watch closed"
time="2024-03-27T08:21:34Z" level=error msg="Failed to install system chart rancher-webhook: pod cattle-system/helm-operation-p42v6 failed, watch closed"

pwurbs commented 2 weeks ago

I face the same issue:

Rancher 2.9.1
RKE v1.27.16+rke2r2
Cluster itself is healthy, node is Ready
All systemd services are green, also agent-service

I was stuck at log line "non-ready bootstrap machine(s) custom-4ff27335d03c and join url to be available on bootstrap node". In my specific case, I only had to remove the config item rke_config.machine_global_config.disable-cloud-controller to get the cluster provisioned. So it's my conclusion, that it's possible to configure the cluster using rke_config, but this specific config parameter must be left out. Maybe the cloud-controller is required even if no any cloud services are used. It seems that then the RKE2 embedded cloud provide is being used.

rancher / rancher

[BUG] fail to add master node into newly created cluser, pending at "Waiting for Node Ref", on Rancher v2.7.9 #43701