kinvolk / lokomotive

🪦 DISCONTINUED Further Lokomotive development has been discontinued. Lokomotive is a 100% open-source, easy to use and secure Kubernetes distribution from the volks at Kinvolk
https://kinvolk.io/lokomotive-kubernetes/
Apache License 2.0
321 stars 49 forks source link

Cluster state backup and disaster recovery #313

Open iaguis opened 4 years ago

iaguis commented 4 years ago

While we do have a Velero component it's only supported on AKS and we don't have a well documented backup/recovery story.

We should have one.

iaguis commented 4 years ago

Let's explore how to use Velero to back up stateful clusters for other platforms than AKS.

johananl commented 4 years ago

My thoughts on this:

ipochi commented 4 years ago

Questions:

  1. Will our approach to backup and disaster recovery include only stateful workloads(entire PVs) i.e use velero + restic for supported platforms ? Or we we must also consider backup and restore of control plane data (etcd) and relevant certificates ?

  2. Investigate if velero is the right choice moving forward or are there any alternatives that fit Lokomotive and our supported platforms better ?

  3. To what level of this feature are we talking with the solution for our currently supported storage components + node local storage coupled with the supported platforms ?

Below is the matrix (incomplete) that we are currently thinking for our backup and recovery for stateful workloads. Please add or correct me if I am wrong. Will update ?? as I keep finding out more.

Platform Storage Backup Recovery Tool
Packet Node Storage Velero + Restic plugin
Packet OpenEBS Velero + OpenEBS velero plugin
Packet Rook-Ceph Velero + Restic plugin
AWS EBS Volumes Velero + AWS EBS snapshot plugin
AWS ?? Node storage ?? ??
Baremetal Node Storage ??
Baremetal ?? OpenEBS ?? ??
Baremetal ?? Rook-Ceph ?? ??
AKS ?? Velero

As this is a roadmap item, lets break this roadmap task into multiple smaller chunks (please add more if you seem so)

@iaguis @invidian @johananl @rata @surajssd

invidian commented 4 years ago

Will our approach to backup and disaster recovery include only stateful workloads(entire PVs) i.e use velero + restic for supported platforms ? Or we we must also consider backup and restore of control plane data (etcd) and relevant certificates ?

IMO PV backups would be a very good start, as the scope of the task would be smaller.

Investigate if velero is the right choice moving forward or are there any alternatives that fit Lokomotive and our supported platforms better ?

Velero seems resonable, but I think we could spend a bit of time also testing the alternatives.

We could also consider backing up Terraform state somehow maybe (as secret on the cluster?), as Terraform state contains important information like references to the cloud objects. I think it would be neat to store it in the cluster and be able to pull it, if you have the right kubeconfig file.

Backup and restore strategies for control plane data(etcd), relevant certificates etc.

IMO as a general strategy, we should try to avoid backing up things (or restoring them), if they can be re-generated. One example here would be cluster certificates, for which we should have rotation mechanism, which should be sufficient for replacing all certificates with new ones.

Velero manifests backups are valuable, as we cannot guarantee, that all user's workload are reproducible etc.

However, I think we could also consider implementing etcd snapshots and recovery, as it would be much simpler to recover from etcd snapshot than from Velero backup, in automated way.

surajssd commented 4 years ago

We decided to break this down into smaller doable tasks.

ipochi commented 4 years ago

We could also consider backing up Terraform state somehow maybe (as secret on the cluster?), as Terraform state contains important information like references to the cloud objects. I think it would be neat to store it in the cluster and be able to pull it, if you have the right kubeconfig file.

Our recommended option for storing terraform state is an S3 bucket. I dont understand the need to store it in the cluster as well. Even if we would, how this affects the disaster recovery of the cluster ?

ipochi commented 4 years ago

Created issues to track this roadmap item as smaller actionable chunks.

800 - As a first step of the actionable item is to investigate the continued usage of Velero as an official supported component for Backup and restore for Lokomotive clusters. This action item is considered done and included mostly for the purpose of tracking.

Second actionable item is to investigate Velero on the supported platforms for different combination of storage.

797 - Packet

798 - Baremetal

799 - AWS

Three separate issues are created because of different storage combinations (node local storage, openebs (cstor/localpv), rook/rook-ceph)