Open iaguis opened 4 years ago
Let's explore how to use Velero to back up stateful clusters for other platforms than AKS.
My thoughts on this:
Questions:
Will our approach to backup and disaster recovery include only stateful workloads(entire PVs) i.e use velero + restic for supported platforms ? Or we we must also consider backup and restore of control plane data (etcd) and relevant certificates ?
Investigate if velero is the right choice moving forward or are there any alternatives that fit Lokomotive and our supported platforms better ?
To what level of this feature are we talking with the solution for our currently supported storage components + node local storage coupled with the supported platforms ?
Below is the matrix (incomplete) that we are currently thinking for our backup and recovery for stateful workloads. Please add or correct me if I am wrong. Will update ??
as I keep finding out more.
Platform | Storage | Backup Recovery Tool |
---|---|---|
Packet | Node Storage | Velero + Restic plugin |
Packet | OpenEBS | Velero + OpenEBS velero plugin |
Packet | Rook-Ceph | Velero + Restic plugin |
AWS | EBS Volumes | Velero + AWS EBS snapshot plugin |
AWS | ?? Node storage ?? | ?? |
Baremetal | Node Storage | ?? |
Baremetal | ?? OpenEBS ?? | ?? |
Baremetal | ?? Rook-Ceph ?? | ?? |
AKS | ?? | Velero |
As this is a roadmap item, lets break this roadmap task into multiple smaller chunks (please add more if you seem so)
@iaguis @invidian @johananl @rata @surajssd
Will our approach to backup and disaster recovery include only stateful workloads(entire PVs) i.e use velero + restic for supported platforms ? Or we we must also consider backup and restore of control plane data (etcd) and relevant certificates ?
IMO PV backups would be a very good start, as the scope of the task would be smaller.
Investigate if velero is the right choice moving forward or are there any alternatives that fit Lokomotive and our supported platforms better ?
Velero seems resonable, but I think we could spend a bit of time also testing the alternatives.
We could also consider backing up Terraform state somehow maybe (as secret on the cluster?), as Terraform state contains important information like references to the cloud objects. I think it would be neat to store it in the cluster and be able to pull it, if you have the right kubeconfig
file.
Backup and restore strategies for control plane data(etcd), relevant certificates etc.
IMO as a general strategy, we should try to avoid backing up things (or restoring them), if they can be re-generated. One example here would be cluster certificates, for which we should have rotation mechanism, which should be sufficient for replacing all certificates with new ones.
Velero manifests backups are valuable, as we cannot guarantee, that all user's workload are reproducible etc.
However, I think we could also consider implementing etcd snapshots and recovery, as it would be much simpler to recover from etcd snapshot than from Velero backup, in automated way.
We decided to break this down into smaller doable tasks.
We could also consider backing up Terraform state somehow maybe (as secret on the cluster?), as Terraform state contains important information like references to the cloud objects. I think it would be neat to store it in the cluster and be able to pull it, if you have the right kubeconfig file.
Our recommended option for storing terraform state is an S3 bucket. I dont understand the need to store it in the cluster as well. Even if we would, how this affects the disaster recovery of the cluster ?
Created issues to track this roadmap item as smaller actionable chunks.
Second actionable item is to investigate Velero on the supported platforms for different combination of storage.
Three separate issues are created because of different storage combinations (node local storage, openebs (cstor/localpv), rook/rook-ceph)
While we do have a Velero component it's only supported on AKS and we don't have a well documented backup/recovery story.
We should have one.