k0sproject / k0sctl

A bootstrapping and management tool for k0s clusters.
Other
386 stars 76 forks source link

Restoring kubelet config from backup fails #367

Open jsalgado78 opened 2 years ago

jsalgado78 commented 2 years ago

I'm getting this error in each worker server after restoring from backup: $ journalctl -u k0sworker.service

level=info msg="Starting kubelet" level=warning msg="failed to get initial kubelet config with join token: failed to get kubelet config from API: Unauthorized" level=warning msg="failed to get initial kubelet config with join token: failed to get kubelet config from API: Unauthorized" ...

$ k get nodes NAME STATUS ROLES AGE VERSION s504des204k0s NotReady 9d v1.23.5+k0s s504des205k0s NotReady 9d v1.23.5+k0s

I've a cluster backup using this command: k0sctl backup -c k0s-des-airgap.yml --disable-telemetry --disable-upgrade-check

I've reset all nodes with k0sctl reset -c k0s-des-airgap.yml --disable-telemetry and I've tried to restore with this command:

k0sctl apply --restore-from k0s_backup_1649314850.tar.gz -c k0s-des-airgap.yml --disable-telemetry

I've three master nodes and two worker nodes.

I'm running k0sctl v0.12.6 and k0s v1.23.5+k0s with airgap config

Thanks.

kke commented 2 years ago

This is pretty much exactly what the CI backup-restore test does.

I wonder if the token somehow expired between the backup and restore 🤔

jsalgado78 commented 2 years ago

Backup was made some minutes earlier to restoring process. I was probing if restore process works fine but it doesn't

jsalgado78 commented 2 years ago

I've probed to deploy an empty new cluster with same results after reset and restore from backup.

I've made a backup running command: ./k0sctl backup -c k0s-des-airgap.yml --disable-telemetry --disable-upgrade-check

I've reset all nodes with command: ./k0sctl reset -c k0s-des-airgap.yml --disable-telemetry

I've rebooted all cluster nodes after running k0sctl reset command

I've restore from backup with command:

./k0sctl apply --restore-from k0s_backup_1649341000.tar.gz -c k0s-des-airgap.yml --disable-telemetry --disable-upgrade-check

Restore process appears to finish fine: ...

INFO [ssh] s504des204k0s:22: waiting for node to become ready                                                                            
INFO [ssh] s504des205k0s:22: starting service                                                                                            
INFO [ssh] s504des205k0s:22: waiting for node to become ready                                                                            
INFO ==> Running phase: Run After Apply Hooks                                                                                                                 
INFO ==> Running phase: Disconnect from hosts                                                                                                                 
INFO ==> Finished in 1m20s                
INFO k0s cluster version 1.23.5+k0s.0 is now installed 
INFO Tip: To access the cluster you can now fetch the admin kubeconfig using: 
INFO      k0sctl kubeconfig 

But worker nodes are in NoReady status and /etc/k0s/k0stoken files are empty:

$ ./kubectl get nodes NAME STATUS ROLES AGE VERSION s504des204k0s NotReady 11m v1.23.5+k0s s504des205k0s NotReady 11m v1.23.5+k0s

[root@s504des204k0s ~]# journalctl -u k0sworker.service | tail -5
abr 07 16:30:50 s504des204k0s k0s[3551]: #8: failed to get kubelet config from API: Unauthorized
abr 07 16:30:50 s504des204k0s k0s[3551]: #9: failed to get kubelet config from API: Unauthorized
abr 07 16:30:50 s504des204k0s k0s[3551]: #10: failed to get kubelet config from API: Unauthorized
abr 07 16:30:50 s504des204k0s systemd[1]: k0sworker.service: Main process exited, code=exited, status=1/FAILURE
abr 07 16:30:50 s504des204k0s systemd[1]: k0sworker.service: Failed with result 'exit-code'.

[root@s504des204k0s ~]# cat /etc/k0s/k0stoken 
# overwritten by k0sctl after join
[root@s504des205k0s ~]# journalctl -u k0sworker|tail -5
abr 07 16:30:50 s504des205k0s k0s[3538]: #8: failed to get kubelet config from API: Unauthorized
abr 07 16:30:50 s504des205k0s k0s[3538]: #9: failed to get kubelet config from API: Unauthorized
abr 07 16:30:50 s504des205k0s k0s[3538]: #10: failed to get kubelet config from API: Unauthorized
abr 07 16:30:50 s504des205k0s systemd[1]: k0sworker.service: Main process exited, code=exited, status=1/FAILURE
abr 07 16:30:50 s504des205k0s systemd[1]: k0sworker.service: Failed with result 'exit-code'.

[root@s504des205k0s ~]# cat /etc/k0s/k0stoken
# overwritten by k0sctl after join
jsalgado78 commented 2 years ago

This issue is like https://github.com/k0sproject/k0sctl/issues/178 I've three master nodes and two workers nodes. There's a F5 load balancer to get HA like is documented in https://docs.k0sproject.io/v1.23.3+k0s.1/high-availability/ A new cluster installed works fine but it doesn't work when restoring from backup.

jsalgado78 commented 2 years ago

I've created a new token in a master node and I've installed k0s in worker nodes

[root@s504des196k0s ~]# /usr/local/bin/k0s token create --role worker --expiry 20m0s ...

I've copied the new token in /etc/k0s/k0stoken file in a worker node and this worker node is in Ready status after running: [root@s504des204k0s ~]# /usr/local/bin/k0s install worker --token-file "/etc/k0s/k0stoken" --kubelet-extra-args="--node-ip=192.168.60.204" && systemctl start k0sworker

[root@s504des196k0s /var/lib/k0s/pki]# /usr/local/bin/k0s  kubectl --kubeconfig=/var/lib/k0s/pki/admin.conf get nodes
NAME                                 STATUS   ROLES    AGE   VERSION
s504des204k0s                           Ready    <none>   10d   v1.23.5+k0s
kke commented 2 years ago

Any ideas @k0sproject/k0s-core ?

jnummelin commented 2 years ago

My guess is that:

One option is to run reset only for the controlplane controlplane nodes, with something like "k0sctl reset --controllers-only". (we do NOT currently have that baked in) as we do not anyways backup workers at all.

kke commented 2 years ago

The bugs here being:

  1. apply after restore will trust the result of Ready state even though it will reflect reality only after minutes possibly
  2. the smoke test does not fail even though the cluster is broken after restore

During restore, apply should try to figure out if the workers have been reset and only reconfigure them if it is so. If they're intact, it needs some way to determine when the Ready state info is up to date.

As an enhancement, k0sctl reset could have a flag for only resetting the control plane.