Handle volume failures gracefully

In most cases, when a control plane node dies, reboots, is replaced, ..., the EBS disks for etcd-manager are just detached and mounted on the new instance. This is the happy scenario.

There are those cases where one of the volumes fails or is deleted (by mistake or during failure simulations). In this case, etcd-manager will not know what to do and go into a failure loop. Even adding a new blank volume won't help, because it doesn't know what to do with it.

Reproduce:

create a cluster with 3 control plane nodes
foce delete the EBS volumes of one of the control plane nodes and terminate the node
wait for the replacement node to come up (ASG will do that)
check the error logs of the etcd-manager pods
run kops update cluster --yes to create new blank volumes for etcd
check the error logs to see volumes are attached and formatted, but etcd-manager is still lost

Workaround:

Delete s3://<bucket>/<cluster>/backups/etcd/main/control/etcd-cluster-created
Wait for the main etcd cluster to reset

Restore latest main backup:

etcd-manager-ctl --backup-store=s3://my.clusters/test.my.clusters/backups/etcd/main restore-backup

Delete s3://<bucket>/<cluster>/backups/etcd/events/control/etcd-cluster-created
Wait for the events etcd cluster to reset

Restore latest events backup:

etcd-manager-ctl --backup-store=s3://my.clusters/test.my.clusters/backups/etcd/events restore-backup

kopeio / etcd-manager

Handle volume failures gracefully #339

Reproduce:

Workaround: