etcd doesnt start after node was down

muecahit94 commented 3 years ago

We have the problem, that a node failed and now the etcd-0 is failed and cannot start on another node anymore. In our cluster we have 3 ETCDs, the ETCD cluster is still running, but how can we add the new ETCD pod to the cluster. We had a similar situation on an other cluster, but the cluster was not in productive usage and it was the etcd-2 pod, we could fix it with the following steps:

remove the ETCD Member with: etcdctl member remove [MemberID]
add the member back with: etcdctl member add piraeus-op-etcd-2 --peer-urls=http://piraeus-op-etcd-2.piraeus-op-etcd:2380
scale down to 0 and then back to 3, the new ETCD member was added

This time we have another case, the cluster is in productive usage and we cannot play around. and its not the etcd-2 pod, its the etcd-0, so we dont know if there is a good way to remove the etcd-0 and add it back.

Versions: K8S v1.19.8 etcd-development/etcd:v3.4.9 if I am not wrong we installed the piraeus-operater v1.3.1 few months ago

We tried to remove the member and add it back with empty datafolder with the command "etcdctl member add piraeus-op-etcd-0 --peer-urls=http://piraeus-op-etcd-0.piraeus-op-etcd:2380", then restart etcd-0 but didnt help, it restarts all the time after throwing this messages:

Waiting for piraeus-op-etcd-0.piraeus-op-etcd to come up Waiting for piraeus-op-etcd-1.piraeus-op-etcd to come up Waiting for piraeus-op-etcd-2.piraeus-op-etcd to come up Waiting for piraeus-op-etcd-0.piraeus-op-etcd to come up [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead 2021-05-30 21:06:03.194339 I | etcdmain: etcd Version: 3.4.9 2021-05-30 21:06:03.194380 I | etcdmain: Git SHA: 54ba95891 2021-05-30 21:06:03.194384 I | etcdmain: Go Version: go1.12.17 2021-05-30 21:06:03.194387 I | etcdmain: Go OS/Arch: linux/amd64 2021-05-30 21:06:03.194390 I | etcdmain: setting maximum number of CPUs to 16, total number of available CPUs is 16 2021-05-30 21:06:03.194485 N | etcdmain: the server is already initialized as member before, starting as etcd member... [WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead 2021-05-30 21:06:03.194731 I | embed: name = piraeus-op-etcd-0 2021-05-30 21:06:03.194742 I | embed: data dir = /var/run/etcd/default.etcd 2021-05-30 21:06:03.194746 I | embed: member dir = /var/run/etcd/default.etcd/member 2021-05-30 21:06:03.194749 I | embed: heartbeat = 100ms 2021-05-30 21:06:03.194752 I | embed: election = 1000ms 2021-05-30 21:06:03.194755 I | embed: snapshot count = 100000 2021-05-30 21:06:03.194763 I | embed: advertise client URLs = http://piraeus-op-etcd-0.piraeus-op-etcd:2379 2021-05-30 21:06:03.194769 I | embed: initial advertise peer URLs = http://piraeus-op-etcd-0.piraeus-op-etcd:2380 2021-05-30 21:06:03.194775 I | embed: initial cluster = 2021-05-30 21:06:03.195632 I | etcdserver: restarting member 16619bcda1f20c2e in cluster c4baab52329b7b3d at commit index 3 raft2021/05/30 21:06:03 INFO: 16619bcda1f20c2e switched to configuration voters=() raft2021/05/30 21:06:03 INFO: 16619bcda1f20c2e became follower at term 1857 raft2021/05/30 21:06:03 INFO: newRaft 16619bcda1f20c2e [peers: [], term: 1857, commit: 3, applied: 0, lastindex: 3, lastterm: 1] 2021-05-30 21:06:03.200015 W | auth: simple token is not cryptographically signed 2021-05-30 21:06:03.200787 I | etcdserver: starting server... [version: 3.4.9, cluster version: to_be_decided] raft2021/05/30 21:06:03 INFO: 16619bcda1f20c2e switched to configuration voters=(450703631586928011) 2021-05-30 21:06:03.202223 I | etcdserver/membership: added member 6413890a3b7a58b [http://piraeus-op-etcd-1.piraeus-op-etcd:2380] to cluster c4baab52329b7b3d 2021-05-30 21:06:03.202254 I | rafthttp: starting peer 6413890a3b7a58b... 2021-05-30 21:06:03.202311 I | rafthttp: started HTTP pipelining with peer 6413890a3b7a58b 2021-05-30 21:06:03.204324 I | rafthttp: started streaming with peer 6413890a3b7a58b (writer) 2021-05-30 21:06:03.204834 I | rafthttp: started streaming with peer 6413890a3b7a58b (writer) 2021-05-30 21:06:03.205652 I | rafthttp: started peer 6413890a3b7a58b 2021-05-30 21:06:03.205685 I | rafthttp: added peer 6413890a3b7a58b 2021-05-30 21:06:03.205757 I | rafthttp: started streaming with peer 6413890a3b7a58b (stream Message reader) raft2021/05/30 21:06:03 INFO: 16619bcda1f20c2e switched to configuration voters=(450703631586928011 1612741449062943790) 2021-05-30 21:06:03.205851 I | rafthttp: started streaming with peer 6413890a3b7a58b (stream MsgApp v2 reader) 2021-05-30 21:06:03.205900 I | etcdserver/membership: added member 16619bcda1f20c2e [http://piraeus-op-etcd-0.piraeus-op-etcd:2380] to cluster c4baab52329b7b3d raft2021/05/30 21:06:03 INFO: 16619bcda1f20c2e switched to configuration voters=(450703631586928011 1612741449062943790 8284708090028106271) 2021-05-30 21:06:03.206061 I | etcdserver/membership: added member 72f9321d15ee221f [http://piraeus-op-etcd-2.piraeus-op-etcd:2380] to cluster c4baab52329b7b3d 2021-05-30 21:06:03.206080 I | rafthttp: starting peer 72f9321d15ee221f... 2021-05-30 21:06:03.206114 I | rafthttp: started HTTP pipelining with peer 72f9321d15ee221f 2021-05-30 21:06:03.207032 I | rafthttp: started streaming with peer 72f9321d15ee221f (writer) 2021-05-30 21:06:03.207853 I | rafthttp: started streaming with peer 72f9321d15ee221f (writer) 2021-05-30 21:06:03.208457 I | rafthttp: started peer 72f9321d15ee221f 2021-05-30 21:06:03.208531 I | rafthttp: added peer 72f9321d15ee221f 2021-05-30 21:06:03.208614 I | rafthttp: started streaming with peer 72f9321d15ee221f (stream Message reader) 2021-05-30 21:06:03.209142 I | rafthttp: started streaming with peer 72f9321d15ee221f (stream MsgApp v2 reader) 2021-05-30 21:06:03.209566 I | embed: listening for peers on [::]:2380 2021-05-30 21:06:03.210217 E | etcdserver: the member has been permanently removed from the cluster 2021-05-30 21:06:03.210234 I | etcdserver: the data-dir used by this member must be removed. 2021-05-30 21:06:03.210288 E | etcdserver: publish error: etcdserver: request cancelled 2021-05-30 21:06:03.210316 E | etcdserver: publish error: etcdserver: request cancelled 2021-05-30 21:06:03.210330 E | etcdserver: publish error: etcdserver: request cancelled 2021-05-30 21:06:03.210343 I | etcdserver: aborting publish because server is stopped 2021-05-30 21:06:03.210376 I | rafthttp: stopping peer 6413890a3b7a58b... 2021-05-30 21:06:03.210393 I | rafthttp: stopped streaming with peer 6413890a3b7a58b (writer) 2021-05-30 21:06:03.210412 I | rafthttp: stopped streaming with peer 6413890a3b7a58b (writer) 2021-05-30 21:06:03.210480 I | rafthttp: stopped HTTP pipelining with peer 6413890a3b7a58b 2021-05-30 21:06:03.210536 I | rafthttp: stopped streaming with peer 6413890a3b7a58b (stream MsgApp v2 reader) 2021-05-30 21:06:03.210566 I | rafthttp: stopped streaming with peer 6413890a3b7a58b (stream Message reader) 2021-05-30 21:06:03.210576 I | rafthttp: stopped peer 6413890a3b7a58b 2021-05-30 21:06:03.210589 I | rafthttp: stopping peer 72f9321d15ee221f... 2021-05-30 21:06:03.210608 I | rafthttp: stopped streaming with peer 72f9321d15ee221f (writer) 2021-05-30 21:06:03.210621 I | rafthttp: stopped streaming with peer 72f9321d15ee221f (writer) 2021-05-30 21:06:03.210655 I | rafthttp: stopped HTTP pipelining with peer 72f9321d15ee221f 2021-05-30 21:06:03.210788 I | rafthttp: stopped streaming with peer 72f9321d15ee221f (stream MsgApp v2 reader) 2021-05-30 21:06:03.210832 I | rafthttp: stopped streaming with peer 72f9321d15ee221f (stream Message reader) 2021-05-30 21:06:03.210851 I | rafthttp: stopped peer 72f9321d15ee221f 2021-05-30 21:06:03.212163 W | rafthttp: failed to process raft message (raft: stopped) 2021-05-30 21:06:03.214198 E | rafthttp: failed to find member 6413890a3b7a58b in cluster c4baab52329b7b3d 2021-05-30 21:06:03.214213 E | rafthttp: failed to find member 6413890a3b7a58b in cluster c4baab52329b7b3d

WanzenBug commented 3 years ago

Hi!

In such cases you can recover etcd using the following steps:

Place a file called nostart in the data directory of the node to recover. The data directory is likely /var/lib/linstor-etcd on your host. That will start the pod without actually starting the etcdserver, meaning you can do any required changes via kubectl exec
Run the etcdctl member add command on one of the remaining nodes. Make sure to copy the environment variables needed to add the member to the cluster
kubectl exec into the member-pod that should be recovered and move /var/run/etcd/default.etcd and /var/run/etcd/member_id to a different location (for example /var/run/etcd/backup). Then, run etcd with the copied environment. That should start the etcd member, which should then sync with the remaining members.
If you don't see any error messages, you can stop etcd again and remove the nostart file and the pod should start the normal etcd instance and the cluster should be healthy again.

muecahit94 commented 3 years ago

@WanzenBug Thank you very much for the fast answer, this solved our problem.

piraeusdatastore / piraeus-operator

etcd doesnt start after node was down #185