ionos-cloud / cluster-api-provider-proxmox

Cluster API Provider for Proxmox VE (CAPMOX)
Apache License 2.0
182 stars 24 forks source link

Cluster Nodes 'NotReady' after reboot #289

Closed cfredericksen closed 2 weeks ago

cfredericksen commented 3 weeks ago

What steps did you take and what happened: I shutdown all controlplane and worker nodes for planned maintenance in the datacenter. I booted them back up. All nodes hostnames were changed to 'CIMISSINGJINJAVARhostname'. Syslog was throwing errors with kubelet about hostname has no permissions to API. I havent changed any of the hostname generation settings. Whatever the defaults are, are set.

What did you expect to happen: All node hostnames would be static and stay in 'Ready' state.

Manual Workaround: ssh root@10.81.0.100 "hostnamectl set-hostname prod-control-plane-dp5xf;systemctl restart kubelet" ssh root@10.81.0.118 "hostnamectl set-hostname prod-control-plane-r2vqq;systemctl restart kubelet" ssh root@10.81.0.111 "hostnamectl set-hostname prod-control-plane-t9sjr;systemctl restart kubelet" ssh root@10.81.0.126 "hostnamectl set-hostname prod-worker-4jtpv;systemctl restart kubelet" ssh root@10.81.0.119 "hostnamectl set-hostname prod-worker-5j99j;systemctl restart kubelet" ssh root@10.81.0.116 "hostnamectl set-hostname prod-worker-5wjn6;systemctl restart kubelet" ssh root@10.81.0.122 "hostnamectl set-hostname prod-worker-8lzjv;systemctl restart kubelet" ssh root@10.81.0.128 "hostnamectl set-hostname prod-worker-9kld5;systemctl restart kubelet"

After a daemon restart all of the nodes would become available again. Environment: Proxmox 8.1

mcbenjemaa commented 2 weeks ago

Thanks for addressing this.

CAPMOX doesn't support the reboot of machines for maintenance reasons. I have tested it before. Although it worked, but I didn't trust it.

In our case, we are managing maintenance differently: we migrate all machines into another node, then we process the maintenance needed, and then we return the machine where they were running, and for that reason, we have an operator that does that. I'm not sure, though, if we will open-source it.

cfredericksen commented 2 weeks ago

Thanks for the info @mcbenjemaa I will adjust if we have to power down the clusters again.