scality / metalk8s

An opinionated Kubernetes distribution with a focus on long-term on-prem deployments
Apache License 2.0
360 stars 45 forks source link

cannot replace dead etcd node #152

Open benoit-a opened 6 years ago

benoit-a commented 6 years ago

One of my node (in kube-master / etcd groups) has failed and was replaced by a fresh new server with a new IP. The install is around 2 months old.

I took dev/0.1 and launch a playbooks/deploy.yml. The playbook failed at some "check etcd health" step.

As a last resort, I recreated the cluster from scratch.

giacomoguiulfo commented 6 years ago

Ran into this issue too for the same exact reason, but the failure when re-deploying was at a different step: TASK [etcd : Join Member | Add member to etcd cluster].

Since I had a 5 node cluster, it was possilbe to continue using the cluster by just keeping 3 nodes for etcd.

thomasdanan commented 4 years ago

@TeddyAndrieux Is this issue still make sense in the context of Metal2.X? Is it related to https://github.com/scality/metalk8s/issues/2186?

TeddyAndrieux commented 4 years ago

We do not manage members the same way in MetalK8s2.x but we may take into account having an etcd node restoration with the same member name as an already existing member but with a different IP. (and more global one having a node restoration with same minion_id but different IP) It's not directly related to #2186 but to restoration in general (by the restore script btw).

E.g.: I'm not sure if we lose a bootstrap node and we try to restore from a backup on a new machine using the same minion_id/node_name as the previous bootstrap node (but with a different IP) if the restore goes well or not, TBC