Makes etcd recovery faster

siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.

https://www.talos.dev

Mozilla Public License 2.0

5.9k stars 471 forks source link

Makes etcd recovery faster #4179

Closed sergelogvinov closed 1 week ago

sergelogvinov commented 2 years ago

Feature Request

Make possible add flag --force-new-cluste and restart etcd easily. Now we have to reboot instance to apply extraArgs to etcd.

Description

According this manual https://www.talos.dev/docs/v0.12/guides/disaster-recovery/ we have to make backup, clean partition and recovery etcd from backup. It is too complicated. And you can make some mistakes in that time.

Disaster recovery plan has to be easy. It that time we have a lot of alerts/messages and human in this situation make mistakes.

So if you restart the etcd with flag force-new-cluste it became to healthy very fast.

Bette to have same comment like

talosctl etcd remove-member --all --force

And talos restart etcd with flag force-new-cluste. In this case you do not have to restart the node. All pods/deployments will work well.

smira commented 2 years ago

For HA clusters, anyways other nodes should be wiped as well, so this "fast" solution only works for single-node clusters? not sure if that's the best trade-off.

Having the etcd snapshot, cluster can be re-bootstrapped even if the control plane nodes are removed, so the process of recovery is almost same as process of initial creation with the only difference the etcd backup is supplied to the talosctl bootstrap

sergelogvinov commented 2 years ago

In this case... You had HA cluster and lost half of it (or more). The probability is not zero to lost more than half etcd nodes. The Talos has to have one intuitive "button" to recovery etcd.

This case can be happen once a year or less. No one won't remember what you have to do. Search documentations increase downtime the cluster. With one command you will recover one node and then scale up controlplane. After remove old bad instances. It can be done through mobile phone/tablets.

Make backup, do not forget to download it!, check backup, clean node, recovery backup. There are many case here to make mistake. And reboot also bad thing here. If it HP dedicate server, it reboot can takes 5 minutes...

Bad thins happens then people are not ready. Someone can be on vocation, another sick. In urgent time people always do mistakes (mach more than usual).

andrewrynhard commented 2 years ago

I am all for making recovery "easy". Would need to vet this ideas some more.

smira commented 2 years ago

Planning Meeting Notes

If etcd loses quorum, etcd membership can't be changed (lost nodes can't be removed on live nodes), so the only way to recover from lost quorum scenario is to go through full disaster recovery procedure - take snapshot from one of the running nodes, wipe all etcd members and re-boostrap the etcd cluster.

So it doesn't seem like having a shortcut will make things any easier.

One valid point is that going through node wipe via talosctl reset command requires a full reboot which might take time on bare metal, and that can be improved with kexec, we should make this better in the future.

Adding a command which wipes etcd easily seems like a potential problem for users who can accidentally wipe their cluster.

:-1:

github-actions[bot] commented 1 week ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 week ago

This issue was closed because it has been stalled for 7 days with no activity.