Network partition during peer discovery in auto clustering causes two clusters to form

AceHack commented 6 years ago

RabbitMQ version 3.7 offical docker image See https://github.com/rabbitmq/rabbitmq-peer-discovery-k8s/issues/12 for logs and env

1) Create DNS or K8s cluster for 5 nodes 2) Force network partitions between so only nodes 1 and 2 can talk to each other and only nodes 3, 4, and 5 can talk to each other 3) Let the rabbit-autocluster plugin run and do peer discovery and cluster formation occur 4) This will create 2 individual split brain cluster to form and be healthy 5) After a sufficent amount of time remove the network partition

NOTE: This also occurs with mismatch erlang cookies, where 2 PCs have cookie 1 and 3 PCs have cookie 2.

See here for more info I also created a rabbitmq-users post on this subject

michaelklishin commented 6 years ago

Peer discovery will cluster with the nodes that can be reached from the discovered list. Forming a cluster with a networking split already in place is going to have this effect. There aren't too many alternatives I can think of, one of them — wait for a set of nodes to be up before forming a cluster — has been tried in the rabbitmq-clusterer plugin and it turned out to be a true operational disaster. Now if one node is down, your entire cluster won't form.

So this works as expected from peer discovery at the moment.

michaelklishin commented 6 years ago

There is no consensus on the best way to address this => mailing list material.

AceHack commented 6 years ago

When adding rabbit to things like K8s it's impossible to know when you do a deploy 10s if not 100s of deploys are happening a day automated if a network partition is occuring or not. This functions as designed would suggest rabbit-autocluster is not K8s production ready.

michaelklishin commented 6 years ago

@AceHack I'm sorry for how blunt I'm going to be but you need to stop fixating so much on the scenario you have discovered. This is not a new discovery, all this we have seen before with BOSH and Cloud Foundry, and things work pretty decently in practice as of 3.6.7+ and once we got rid of rabbitmq-clusterer. There is plenty of production evidence of that

As rabbitmq-autocluster README states, peer discovery is not a substitute for understanding of how RabbitMQ clustering works. You seem to be a bit confused about that (sorry).

The reason why this is NOT a major problem is this: once a cluster is formed, peer discovery isn't used and nodes simply rejoin the peers they know. Do you resize your cluster 10s or 100s times a day? In that case it may or may not be production ready. Existing members will not use peer discovery, only newly brought up will, so the problem won't apply to the majority of nodes.

Your specific suggestions on what can be changed are welcome on the mailing list.

michaelklishin commented 6 years ago

Another idea brought up on rabbitmq-users that is easy to try is this: when a peer is contacted right after discovery, we can introduce the "N retries every T seconds" approach that we already use for re-joining nodes.

In theory it should help with the transient network partition scenario as long as it goes away in a certain window of time. As for possible downsides, the only one I can think of is much slower cluster formation in certain scenarios (when retries happen a lot).

rabbitmq / rabbitmq-server

Network partition during peer discovery in auto clustering causes two clusters to form #1455