Open jsalgado78 opened 2 years ago
This is pretty much exactly what the CI backup-restore test does.
I wonder if the token somehow expired between the backup and restore 🤔
Backup was made some minutes earlier to restoring process. I was probing if restore process works fine but it doesn't
I've probed to deploy an empty new cluster with same results after reset and restore from backup.
I've made a backup running command:
./k0sctl backup -c k0s-des-airgap.yml --disable-telemetry --disable-upgrade-check
I've reset all nodes with command:
./k0sctl reset -c k0s-des-airgap.yml --disable-telemetry
I've rebooted all cluster nodes after running k0sctl reset command
I've restore from backup with command:
./k0sctl apply --restore-from k0s_backup_1649341000.tar.gz -c k0s-des-airgap.yml --disable-telemetry --disable-upgrade-check
Restore process appears to finish fine: ...
INFO [ssh] s504des204k0s:22: waiting for node to become ready
INFO [ssh] s504des205k0s:22: starting service
INFO [ssh] s504des205k0s:22: waiting for node to become ready
INFO ==> Running phase: Run After Apply Hooks
INFO ==> Running phase: Disconnect from hosts
INFO ==> Finished in 1m20s
INFO k0s cluster version 1.23.5+k0s.0 is now installed
INFO Tip: To access the cluster you can now fetch the admin kubeconfig using:
INFO k0sctl kubeconfig
But worker nodes are in NoReady status and /etc/k0s/k0stoken files are empty:
$ ./kubectl get nodes
NAME STATUS ROLES AGE VERSION
s504des204k0s NotReady
[root@s504des204k0s ~]# journalctl -u k0sworker.service | tail -5
abr 07 16:30:50 s504des204k0s k0s[3551]: #8: failed to get kubelet config from API: Unauthorized
abr 07 16:30:50 s504des204k0s k0s[3551]: #9: failed to get kubelet config from API: Unauthorized
abr 07 16:30:50 s504des204k0s k0s[3551]: #10: failed to get kubelet config from API: Unauthorized
abr 07 16:30:50 s504des204k0s systemd[1]: k0sworker.service: Main process exited, code=exited, status=1/FAILURE
abr 07 16:30:50 s504des204k0s systemd[1]: k0sworker.service: Failed with result 'exit-code'.
[root@s504des204k0s ~]# cat /etc/k0s/k0stoken
# overwritten by k0sctl after join
[root@s504des205k0s ~]# journalctl -u k0sworker|tail -5
abr 07 16:30:50 s504des205k0s k0s[3538]: #8: failed to get kubelet config from API: Unauthorized
abr 07 16:30:50 s504des205k0s k0s[3538]: #9: failed to get kubelet config from API: Unauthorized
abr 07 16:30:50 s504des205k0s k0s[3538]: #10: failed to get kubelet config from API: Unauthorized
abr 07 16:30:50 s504des205k0s systemd[1]: k0sworker.service: Main process exited, code=exited, status=1/FAILURE
abr 07 16:30:50 s504des205k0s systemd[1]: k0sworker.service: Failed with result 'exit-code'.
[root@s504des205k0s ~]# cat /etc/k0s/k0stoken
# overwritten by k0sctl after join
This issue is like https://github.com/k0sproject/k0sctl/issues/178 I've three master nodes and two workers nodes. There's a F5 load balancer to get HA like is documented in https://docs.k0sproject.io/v1.23.3+k0s.1/high-availability/ A new cluster installed works fine but it doesn't work when restoring from backup.
I've created a new token in a master node and I've installed k0s in worker nodes
[root@s504des196k0s ~]# /usr/local/bin/k0s token create --role worker --expiry 20m0s ...
I've copied the new token in /etc/k0s/k0stoken file in a worker node and this worker node is in Ready status after running:
[root@s504des204k0s ~]# /usr/local/bin/k0s install worker --token-file "/etc/k0s/k0stoken" --kubelet-extra-args="--node-ip=192.168.60.204" && systemctl start k0sworker
[root@s504des196k0s /var/lib/k0s/pki]# /usr/local/bin/k0s kubectl --kubeconfig=/var/lib/k0s/pki/admin.conf get nodes
NAME STATUS ROLES AGE VERSION
s504des204k0s Ready <none> 10d v1.23.5+k0s
Any ideas @k0sproject/k0s-core ?
My guess is that:
Node
objects (they probably were in "Ready" state at the time of the backup?)One option is to run reset only for the controlplane controlplane nodes, with something like "k0sctl reset --controllers-only". (we do NOT currently have that baked in) as we do not anyways backup workers at all.
The bugs here being:
During restore, apply should try to figure out if the workers have been reset and only reconfigure them if it is so. If they're intact, it needs some way to determine when the Ready state info is up to date.
As an enhancement, k0sctl reset
could have a flag for only resetting the control plane.
I'm getting this error in each worker server after restoring from backup: $ journalctl -u k0sworker.service
level=info msg="Starting kubelet" level=warning msg="failed to get initial kubelet config with join token: failed to get kubelet config from API: Unauthorized" level=warning msg="failed to get initial kubelet config with join token: failed to get kubelet config from API: Unauthorized" ...
$ k get nodes NAME STATUS ROLES AGE VERSION s504des204k0s NotReady 9d v1.23.5+k0s
s504des205k0s NotReady 9d v1.23.5+k0s
I've a cluster backup using this command: k0sctl backup -c k0s-des-airgap.yml --disable-telemetry --disable-upgrade-check
I've reset all nodes with k0sctl reset -c k0s-des-airgap.yml --disable-telemetry and I've tried to restore with this command:
k0sctl apply --restore-from k0s_backup_1649314850.tar.gz -c k0s-des-airgap.yml --disable-telemetry
I've three master nodes and two worker nodes.
I'm running k0sctl v0.12.6 and k0s v1.23.5+k0s with airgap config
Thanks.