Closed muecahit94 closed 3 years ago
Hi!
In such cases you can recover etcd using the following steps:
nostart
in the data directory of the node to recover. The data directory is likely /var/lib/linstor-etcd
on your host. That will start the pod without actually starting the etcdserver, meaning you can do any required changes via kubectl exec
etcdctl member add command
on one of the remaining nodes. Make sure to copy the environment variables needed to add the member to the clusterkubectl exec
into the member-pod that should be recovered and move /var/run/etcd/default.etcd
and /var/run/etcd/member_id
to a different location (for example /var/run/etcd/backup
). Then, run etcd
with the copied environment. That should start the etcd member, which should then sync with the remaining members.etcd
again and remove the nostart
file and the pod should start the normal etcd instance and the cluster should be healthy again.@WanzenBug Thank you very much for the fast answer, this solved our problem.
We have the problem, that a node failed and now the etcd-0 is failed and cannot start on another node anymore. In our cluster we have 3 ETCDs, the ETCD cluster is still running, but how can we add the new ETCD pod to the cluster. We had a similar situation on an other cluster, but the cluster was not in productive usage and it was the etcd-2 pod, we could fix it with the following steps:
This time we have another case, the cluster is in productive usage and we cannot play around. and its not the etcd-2 pod, its the etcd-0, so we dont know if there is a good way to remove the etcd-0 and add it back.
Versions: K8S v1.19.8 etcd-development/etcd:v3.4.9 if I am not wrong we installed the piraeus-operater v1.3.1 few months ago
We tried to remove the member and add it back with empty datafolder with the command "etcdctl member add piraeus-op-etcd-0 --peer-urls=http://piraeus-op-etcd-0.piraeus-op-etcd:2380", then restart etcd-0 but didnt help, it restarts all the time after throwing this messages: