kopeio / etcd-manager

operator for etcd: moved to https://github.com/kubernetes-sigs/etcdadm
Apache License 2.0
164 stars 45 forks source link

Handle volume failures gracefully #339

Open hakman opened 4 years ago

hakman commented 4 years ago

In most cases, when a control plane node dies, reboots, is replaced, ..., the EBS disks for etcd-manager are just detached and mounted on the new instance. This is the happy scenario.

There are those cases where one of the volumes fails or is deleted (by mistake or during failure simulations). In this case, etcd-manager will not know what to do and go into a failure loop. Even adding a new blank volume won't help, because it doesn't know what to do with it.

Reproduce:

  1. create a cluster with 3 control plane nodes
  2. foce delete the EBS volumes of one of the control plane nodes and terminate the node
  3. wait for the replacement node to come up (ASG will do that)
  4. check the error logs of the etcd-manager pods
  5. run kops update cluster --yes to create new blank volumes for etcd
  6. check the error logs to see volumes are attached and formatted, but etcd-manager is still lost

Workaround:

  1. Delete s3://<bucket>/<cluster>/backups/etcd/main/control/etcd-cluster-created
  2. Wait for the main etcd cluster to reset
  3. Restore latest main backup:
    etcd-manager-ctl --backup-store=s3://my.clusters/test.my.clusters/backups/etcd/main restore-backup
  4. Delete s3://<bucket>/<cluster>/backups/etcd/events/control/etcd-cluster-created
  5. Wait for the events etcd cluster to reset
  6. Restore latest events backup:
    etcd-manager-ctl --backup-store=s3://my.clusters/test.my.clusters/backups/etcd/events restore-backup