iaguis commented 4 years ago

While we do have a Velero component it's only supported on AKS and we don't have a well documented backup/recovery story.

We should have one.

iaguis commented 4 years ago

Let's explore how to use Velero to back up stateful clusters for other platforms than AKS.

johananl commented 4 years ago

My thoughts on this:

I think we should use tools such as Velero only when we really need to back up data, i.e. only when there are stateful workloads deployed on the cluster. Better not to treat clusters as "snowflakes" if we can avoid it. In stateless clusters it may be enough to re-deploy a cluster from the version-controlled config and re-apply the YAML files for the workloads, ideally using a CD system if the customer is using one.
The processes will likely be different for different platforms and components. For example, on AWS we can leverage EBS snapshots. On Packet the user may be using OpenEBS and/or Rook/Ceph.
We should document the process of backing up and restoring. A how-to document (or multiple ones) would be ideal for this.

ipochi commented 4 years ago

Questions:

Will our approach to backup and disaster recovery include only stateful workloads(entire PVs) i.e use velero + restic for supported platforms ? Or we we must also consider backup and restore of control plane data (etcd) and relevant certificates ?
Investigate if velero is the right choice moving forward or are there any alternatives that fit Lokomotive and our supported platforms better ?
To what level of this feature are we talking with the solution for our currently supported storage components + node local storage coupled with the supported platforms ?

Below is the matrix (incomplete) that we are currently thinking for our backup and recovery for stateful workloads. Please add or correct me if I am wrong. Will update ?? as I keep finding out more.

Platform	Storage	Backup Recovery Tool
Packet	Node Storage	Velero + Restic plugin
Packet	OpenEBS	Velero + OpenEBS velero plugin
Packet	Rook-Ceph	Velero + Restic plugin
AWS	EBS Volumes	Velero + AWS EBS snapshot plugin
AWS	?? Node storage ??	??
Baremetal	Node Storage	??
Baremetal	?? OpenEBS ??	??
Baremetal	?? Rook-Ceph ??	??
AKS	??	Velero

As this is a roadmap item, lets break this roadmap task into multiple smaller chunks (please add more if you seem so)

[x] Investigate Velero alternatives (if not yet decided)
[ ] Backup and restore strategies for control plane data(etcd), relevant certificates etc.
[ ] Investigate Velero with Lokomotive supported platforms.
[ ] Investigate limitations of Velero with/without restic plugin.
[ ] Packet has a Kubernetes CSI driver , investigate what it would need to add VolumeSnapshotter API to the same, so that we can use velero on Packet with EBS volumes. I suspect we might need Packet CCM for this, which itself may present its own additional challenges.
[ ] Update Velero support for AKS.
[ ] Add Support for Velero for Packet.
[ ] Add support for Velero for AWS.
[ ] Add support for Velero for Baremetal.
[ ] Provide complete set of documentation.

@iaguis @invidian @johananl @rata @surajssd

invidian commented 4 years ago

Will our approach to backup and disaster recovery include only stateful workloads(entire PVs) i.e use velero + restic for supported platforms ? Or we we must also consider backup and restore of control plane data (etcd) and relevant certificates ?

IMO PV backups would be a very good start, as the scope of the task would be smaller.

Investigate if velero is the right choice moving forward or are there any alternatives that fit Lokomotive and our supported platforms better ?

Velero seems resonable, but I think we could spend a bit of time also testing the alternatives.

We could also consider backing up Terraform state somehow maybe (as secret on the cluster?), as Terraform state contains important information like references to the cloud objects. I think it would be neat to store it in the cluster and be able to pull it, if you have the right kubeconfig file.

Backup and restore strategies for control plane data(etcd), relevant certificates etc.

IMO as a general strategy, we should try to avoid backing up things (or restoring them), if they can be re-generated. One example here would be cluster certificates, for which we should have rotation mechanism, which should be sufficient for replacing all certificates with new ones.

Velero manifests backups are valuable, as we cannot guarantee, that all user's workload are reproducible etc.

However, I think we could also consider implementing etcd snapshots and recovery, as it would be much simpler to recover from etcd snapshot than from Velero backup, in automated way.

surajssd commented 4 years ago

We decided to break this down into smaller doable tasks.

ipochi commented 4 years ago

We could also consider backing up Terraform state somehow maybe (as secret on the cluster?), as Terraform state contains important information like references to the cloud objects. I think it would be neat to store it in the cluster and be able to pull it, if you have the right kubeconfig file.

Our recommended option for storing terraform state is an S3 bucket. I dont understand the need to store it in the cluster as well. Even if we would, how this affects the disaster recovery of the cluster ?

ipochi commented 4 years ago

Created issues to track this roadmap item as smaller actionable chunks.

800 - As a first step of the actionable item is to investigate the continued usage of Velero as an official supported component for Backup and restore for Lokomotive clusters. This action item is considered done and included mostly for the purpose of tracking.

Second actionable item is to investigate Velero on the supported platforms for different combination of storage.

797 - Packet

798 - Baremetal

799 - AWS

Three separate issues are created because of different storage combinations (node local storage, openebs (cstor/localpv), rook/rook-ceph)

kinvolk / lokomotive

Cluster state backup and disaster recovery #313

800 - As a first step of the actionable item is to investigate the continued usage of Velero as an official supported component for Backup and restore for Lokomotive clusters. This action item is considered done and included mostly for the purpose of tracking.

797 - Packet

798 - Baremetal

799 - AWS