rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.53k stars 266 forks source link

After a reboot, the node does not return to the rke2 cluster #4516

Closed grig0701 closed 1 year ago

grig0701 commented 1 year ago

We set up a highly available rke2 cluster with 3 masters and 3 workers and after we shut down the second worker for a day we noticed that it was not showing up in the cluster, when we turned it back on it never came back.

We tried reinstalling the agent with a previous uninstall via /usr/local/bin/rke2-uninstall.sh and also removed worker secret from cluster kubectl delete secrets worker-002.node-password.rke2 -n kube-system . After we executed

curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE="agent" sh -

systemctl enable rke2-agent.service

systemctl start rke2-agent.service

But the worker did not return to the cluster with the following logs:

-- Journal begins at Mon 2023-07-17 15:49:01 CEST. --
Jul 27 15:20:08 Debian-1107-bullseye-amd64-base rke2[2365]: Flag --tls-private-key-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.
Jul 27 15:20:08 Debian-1107-bullseye-amd64-base rke2[2270]: time="2023-07-27T15:20:08+02:00" level=info msg="Running kube-proxy --cluster-cidr=10.42.0.0/16 --conntrack-max-per-core=0 --conntrack-tcp-timeout-close-wait=0s --conntrack-tcp-timeout-established=0s --healthz-bind-address=127.0.0.1 --hostname-override=worker-002 --kubeconfig=/var/lib/rancher/rke2/agent/kubeproxy.kubeconfig --proxy-mode=iptables"
Jul 27 15:20:09 Debian-1107-bullseye-amd64-base rke2[2270]: time="2023-07-27T15:20:09+02:00" level=info msg="Failed to set annotations and labels on node worker-002: Operation cannot be fulfilled on nodes \"worker-002\": the object has been modified; please apply your changes to the latest version and try again"
Jul 27 15:20:09 Debian-1107-bullseye-amd64-base rke2[2270]: time="2023-07-27T15:20:09+02:00" level=info msg="Failed to set annotations and labels on node worker-002: Operation cannot be fulfilled on nodes \"worker-002\": the object has been modified; please apply your changes to the latest version and try again"
Jul 27 15:20:09 Debian-1107-bullseye-amd64-base rke2[2270]: time="2023-07-27T15:20:09+02:00" level=info msg="Failed to set annotations and labels on node worker-002: Operation cannot be fulfilled on nodes \"worker-002\": the object has been modified; please apply your changes to the latest version and try again"
Jul 27 15:20:09 Debian-1107-bullseye-amd64-base rke2[2270]: time="2023-07-27T15:20:09+02:00" level=info msg="Failed to set annotations and labels on node worker-002: Operation cannot be fulfilled on nodes \"worker-002\": the object has been modified; please apply your changes to the latest version and try again"
Jul 27 15:20:09 Debian-1107-bullseye-amd64-base rke2[2270]: time="2023-07-27T15:20:09+02:00" level=info msg="Failed to set annotations and labels on node worker-002: Operation cannot be fulfilled on nodes \"worker-002\": the object has been modified; please apply your changes to the latest version and try again"
Jul 27 15:20:09 Debian-1107-bullseye-amd64-base rke2[2270]: time="2023-07-27T15:20:09+02:00" level=info msg="Annotations and labels have been set successfully on node: worker-002"
Jul 27 15:20:09 Debian-1107-bullseye-amd64-base rke2[2270]: time="2023-07-27T15:20:09+02:00" level=info msg="rke2 agent is up and running"
Jul 27 15:20:09 Debian-1107-bullseye-amd64-base systemd[1]: Started Rancher Kubernetes Engine v2 (agent).
Jul 27 15:20:09 Debian-1107-bullseye-amd64-base rke2[2270]: time="2023-07-27T15:20:09+02:00" level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
Jul 27 15:20:10 Debian-1107-bullseye-amd64-base rke2[2270]: time="2023-07-27T15:20:10+02:00" level=debug msg="Tunnel authorizer failed to get Kubelet Port: nodes \"worker-002\" not found"
Jul 27 15:20:11 Debian-1107-bullseye-amd64-base rke2[2270]: time="2023-07-27T15:20:11+02:00" level=debug msg="Tunnel authorizer failed to get Kubelet Port: nodes \"worker-002\" not found"
Jul 27 15:20:12 Debian-1107-bullseye-amd64-base rke2[2270]: time="2023-07-27T15:20:12+02:00" level=debug msg="Tunnel authorizer failed to get Kubelet Port: nodes \"worker-002\" not found"
Jul 27 15:20:13 Debian-1107-bullseye-amd64-base rke2[2270]: time="2023-07-27T15:20:13+02:00" level=debug msg="Tunnel authorizer failed to get Kubelet Port: nodes \"worker-002\" not found"
Jul 27 15:20:13 Debian-1107-bullseye-amd64-base rke2[2270]: time="2023-07-27T15:20:13+02:00" level=debug msg="Wrote ping"
Jul 27 15:20:13 Debian-1107-bullseye-amd64-base rke2[2270]: time="2023-07-27T15:20:13+02:00" level=debug msg="Wrote ping"
Jul 27 15:20:13 Debian-1107-bullseye-amd64-base rke2[2270]: time="2023-07-27T15:20:13+02:00" level=debug msg="Wrote ping"

Mastes logs:

Jul 27 13:16:30 rancher-master-1 rke2[2701897]: time="2023-07-27T13:16:30Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
Jul 27 13:17:30 rancher-master-1 rke2[2701897]: time="2023-07-27T13:17:30Z" level=info msg="certificate CN=worker-2 signed by CN=rke2-server-ca@1690356332: notBefore=2023-07-26 07:25:32 +0000 UTC notAfter=2024-07-26 13:17:30 +0000 UTC"
Jul 27 13:17:31 rancher-master-1 rke2[2701897]: time="2023-07-27T13:17:31Z" level=info msg="certificate CN=system:node:worker-2,O=system:nodes signed by CN=rke2-client-ca@1690356332: notBefore=2023-07-26 07:25:32 +0000 UTC notAfter=2024-07-26 13:17:31 +0000 UTC"
Jul 27 13:17:32 rancher-master-1 rke2[2701897]: time="2023-07-27T13:17:32Z" level=info msg="Handling backend connection request [worker-2]"
Jul 27 13:17:34 rancher-master-1 rke2[2701897]: time="2023-07-27T13:17:34Z" level=error msg="error syncing 'worker-2': handler managed-etcd-controller: Operation cannot be fulfilled on nodes \"worker-2\": StorageError: invalid object, Code: 4, Key: /registry/minions/worker-2, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 1de08688-fea1-4e09-81ad-a928f4bd6e42, UID in object meta: , requeuing"
Jul 27 13:17:56 rancher-master-1 rke2[2701897]: time="2023-07-27T13:17:56Z" level=info msg="error in remotedialer server [400]: websocket: close 1006 (abnormal closure): unexpected EOF"
Jul 27 13:20:03 rancher-master-1 rke2[2701897]: time="2023-07-27T13:20:03Z" level=info msg="certificate CN=system:node:worker-002,O=system:nodes signed by CN=rke2-client-ca@1690356332: notBefore=2023-07-26 07:25:32 +0000 UTC notAfter=2024-07-26 13:20:03 +0000 UTC"
Jul 27 13:20:08 rancher-master-1 rke2[2701897]: time="2023-07-27T13:20:08Z" level=info msg="Handling backend connection request [worker-002]"
Jul 27 13:20:10 rancher-master-1 rke2[2701897]: time="2023-07-27T13:20:10Z" level=error msg="error syncing 'worker-002': handler managed-etcd-controller: Operation cannot be fulfilled on nodes \"worker-002\": StorageError: invalid object, Code: 4, Key: /registry/minions/worker-002, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: 44976cfb-a944-4490-9d86-25090d835ce9, UID in object meta: , requeuing"

My configurations: Example master:

write-kubeconfig-mode: "0644"
tls-san:
  - domain
  - 10.0.0.4
cloud-provider-name: external
node-ip: 10.0.0.4

Example worker:

server: https://domain:9345
token: token
node-ip: 10.0.64.3
node-name: "worker-002"
node-label:
  - "worker=true"
  - "workerid=2"
debug: true

Please help to solve this

Version:

brandond commented 1 year ago

cloud-provider-name: external

nodes \"worker-002\" not found"

What cloud provider have you deployed? The log messages suggest that something - most likely your cloud provider - is deleting the node from the cluster when it goes down for a reboot.

VenusKanami commented 1 year ago

cloud-provider-name: external

nodes \"worker-002\" not found"

What cloud provider have you deployed? The log messages suggest that something - most likely your cloud provider - is deleting the node from the cluster when it goes down for a reboot.

Hi. Thank you for your reply. I am working on this project together with the author of the question. We are using Hetzner dedicated servers, one of them cannot be restored in the cluster as a worker and Hetzner Cloud instances as masters in the cluster. Reinstalling rke2-agent on the problem node and removing the worker secret via kubectl does not help to restore the dedicated node in the cluster. Also, the node does not appear in kubectl get nodes -A

grig0701 commented 1 year ago

cloud-provider-name: external nodes "worker-002" not found"

What cloud provider have you deployed? The log messages suggest that something - most likely your cloud provider - is deleting the node from the cluster when it goes down for a reboot.

I also want to add that we use hcloud-cloud-controller-manager

brandond commented 1 year ago

Check the cloud controller pod log; I suspect that it is deleting the node for some reason. There is nothing in RKE2 itself that will delete the node resource.

grig0701 commented 1 year ago

Check the cloud controller pod log; I suspect that it is deleting the node for some reason. There is nothing in RKE2 itself that will delete the node resource.

Hello, thank you for your feedback. The problem was in hcloud-cloud-controller-manager. More precisely, this controller does not currently support the dedicated servers which we used for our workers. We decided not to use it, now everything works as it should.

Thanks for helping me understand what went wrong