hashicorp / memberlist

Golang package for gossip based membership and failure detection
Mozilla Public License 2.0
3.62k stars 435 forks source link

Permanent cluster member #222

Open champtar opened 4 years ago

champtar commented 4 years ago

Hi All,

I'm using MemberList to provide fast dead node detection in MetalLB, and I feel some feature that I'm writing around Memberlist should be included:

If I have 4 members on 4 nodes, and I have a network outage for 1 or 2 minutes, Memberlist communication will timeout and Memberlist will not recover, considering that the only member that is alive is the local member.

Would it make sense to you to:

The idea is to have the external code just call PermanentJoin(hostlist) when they see a change in K8S api

mayuresh82 commented 3 years ago

Any workaround for this ? Can the client simply attempt to rejoin periodically as a workaround ?

champtar commented 3 years ago

That is what we now do in MetalLB, periodic reJoin

stilldavid commented 9 months ago

I noticed this in a fairly simple implementation. If there's a network outage to a single node, the other nodes correctly kick it out of the list, but the single node kicks everyone else out of their list as well, becoming isolated and never rejoining, even after the network comes back.

Current workaround is to periodically check the list for a member count of 1 and rejoin if so. I'd love for Join() to be cheaper to call (a no-op for existing members, as @champtar recommended) so we can call it periodically without side effects (syncing state, which might be expensive and unnecessary), or have a better internal mechanism to detect a solo split brain as it seems like it might be a pretty common case.