Closed schlichtanders closed 1 year ago
@kube-hetzner/core @ifeulner Some of you got this functionality to work, correct?
Looking further, there is even a larger BUG, namely that the k3s_token
cannot be set when constructing a cluster from scratch. I am still including it here, so that the theme is now rather "kube-hetzner disaster recovery via k3s etcd (s3) backups" in general.
The documentation states
The server token can be used to join both server and agent nodes to the cluster. It cannot be changed once the cluster has been created
which apparently means that no one can recreate a cluster from a backup using kube-hetzner currently (or have I misunderstood something?
@schlichtanders I can definitely confirm from memory that many of our users have been able to recreate the cluster from backup. Now that I think about this more, all you need to do to get the token is cat the /etc/rancher/k3s/config.yaml
file of any agent node by SSH'ing into them.
@schlichtanders I think you are right, especially after reading the latest k3s documentation (I am quite sure that warning was not in there in earlier versions of the documentation..). I did some restore tests some months ago, but always with some work "by hand", i.e. copying the token from the initial server and also do the restore via the cli, not via terraform.
To do it via kube-hetzner, of course at least the token needs to be saved somewhere (secret output?), and used when restoring. But for cluster restoration anyways there is some hands-on work needed... and always be, as the cluster situation can be quite different from case to case.
So what do you think would be the best the project should do here?
@mysticaltech yes, this works if you know it. But of course is not enough if your cluster crashed completely and you now are trying to restore, because the token will be gone with the cluster. Hence this bug report, that the token should also be stored somehow.
The second point however seems crucial for everyone, also those who got the token from the controlplane server: Kube-hetzner currently has no way to specify the token manually, hence it is impossible to recreate a crashed cluster from scratch with the backup. @ifeulner how do you create a cluster with the old token?
for cluster restoration anyways there is some hands-on work needed...
@ifeulner can you share a script or tutorial or something which is working for you for restoration?
EDIT: I was able to restore configmaps and such, however the cluster is still dysfunctional. Nothing can be deployed. The typical error message is
0/4 nodes are available: 1 node(s) had untolerated taint {node.cloudprovider.kubernetes.io/uninitialized: true}, 3 node(s) had untolerated taint {node.kubernetes.io/unreachable: }. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
this means
I guess this is a bit out of topic, but it would be really helpful if you could share information about how you do the restoration - it looks very difficult to me. (my motivation for all this is to make changes to the cluster for which the cluster needs to be recreated, but I don't want to loose configuration, it would so awesome if I could just recreate the cluster)
I guess this is a bit out of topic, but it would be really helpful if you could share information about how you do the restoration - it looks very difficult to me. (my motivation for all this is to make changes to the cluster for which the cluster needs to be recreated, but I don't want to loose configuration, it would so awesome if I could just recreate the cluster)
I managed to do a full restoration of the cluster! :partying_face: I will make a pullrequest with some small necessary additions soon.
@schlichtanders Looking forward to try your approach. Thanks
The current PR has been merged and released in v2.6.2 🚀
Keeping this issue open until all related objectives have been achieved.
@schlichtanders Closing for now, let me know if it needs reopening, but I think I understood from our other exchanges that the restorations are working now.
yes, restoration works finally :slightly_smiling_face: (since yesterday evening). A documentation about the restoration process is going to be added today as part of this pullrequest https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/pull/968
Description
Summary
First time I am trying the etcd backup recovery and to my surprises, the k3s token is needed for restoration. https://docs.k3s.io/datastore/backup-restore
This token however is a randomly created string which is never stored on the user side. Hence, if the cluster crashes, there is no way to restore even if etcd backup was setup correctly (at least this is my current understanding)
PoC
I tried to restore a backup testwise using k3s on docker
It outputs
Of course to replicate on another end, you need a backup (on S3) and set the environment variables (or the server args) respectively.
Impact
etcd backup is not working for a whole cluster loss
Kube.tf file
Screenshots
No response
Platform
Linux