In most cases, when a control plane node dies, reboots, is replaced, ..., the EBS disks for etcd-manager are just detached and mounted on the new instance. This is the happy scenario.
There are those cases where one of the volumes fails or is deleted (by mistake or during failure simulations). In this case, etcd-manager will not know what to do and go into a failure loop. Even adding a new blank volume won't help, because it doesn't know what to do with it.
Reproduce:
create a cluster with 3 control plane nodes
foce delete the EBS volumes of one of the control plane nodes and terminate the node
wait for the replacement node to come up (ASG will do that)
check the error logs of the etcd-manager pods
run kops update cluster --yes to create new blank volumes for etcd
check the error logs to see volumes are attached and formatted, but etcd-manager is still lost
In most cases, when a control plane node dies, reboots, is replaced, ..., the EBS disks for etcd-manager are just detached and mounted on the new instance. This is the happy scenario.
There are those cases where one of the volumes fails or is deleted (by mistake or during failure simulations). In this case, etcd-manager will not know what to do and go into a failure loop. Even adding a new blank volume won't help, because it doesn't know what to do with it.
Reproduce:
kops update cluster --yes
to create new blank volumes for etcdWorkaround:
s3://<bucket>/<cluster>/backups/etcd/main/control/etcd-cluster-created
main
etcd cluster to resetmain
backup:s3://<bucket>/<cluster>/backups/etcd/events/control/etcd-cluster-created
events
etcd cluster to resetevents
backup: