ministryofjustice / cloud-platform

Documentation on the MoJ cloud platform
MIT License
87 stars 45 forks source link

etcd backups #464

Closed alkar closed 5 years ago

alkar commented 6 years ago

Background

The kubernetes state is stored in etcd. In our setup, etcd uses dedicated EBS volumes (one per master) to store its data. These are re-used when replacing master nodes.

Proposed user journey

Approach

Taking backups

We want to take EBS snapshots every n hours and retain them for m days. Use whatever makes sense, we should be able to change these values easily in the future.

How this is implemented is still open, use the least complicated approach that's not Lambda? (bonus if this can be implemented in kube and managed as any other app).

Note: if possible, set it up so that the snapshot also contains the original tags.

Restoring

We want to be able to restore from the latest backup in case of failure. To limit the scope of this exercise assume that we have a running cluster and backups of its etcd state.

The process is roughly the following:

  1. Create a resource in kubernetes (a control) that should disappear once we restore from backups.
  2. Create three new EBS volumes from the snapshots (each volume should be create in the same AZ as the original volume). Make sure they have the correct tags (see the linked doc at the bottom).
  3. Delete the tags from the current etcd data volumes.
  4. Terminate the master nodes.
  5. Wait.
  6. Delete old etcd data volumes.

Which part of the user docs does this impact

Questions / Assumptions

As described above.

Definition of done

Reference

How to write good user stories

razvan-moj-zz commented 6 years ago

Maybe one of these will do the job - https://aws.amazon.com/backup-recovery/partner-solutions/ ? Cloudberry is cheap Veeam has a free edition Commvault is the only one I've used, works very well, is very expensive etc

sablumiah commented 6 years ago

https://github.com/ministryofjustice/cloud-platform-aws-meta-configuration/pull/18