Closed AceHack closed 6 years ago
I would like to add the scenario described above creates a serious split brain problem where you end up with two isolated clusters acting independently even though they are listed as one unified cluster in K8s. This will cause traffic to get load balanced between all 5 nodes even though you actually have a 3 node and 5 node cluster respectively.
Just to explain the problem a little further.
First off 3 node cluster came up successfully with random erlang secret.
Then when scaling the statefulset to 5 nodes a new random erlang secret was created which caused the two new nodes to fail to join the original 3 nodes.
The two new nodes did, however, join each other in a new cluster.
This is basically the definition of the split brain problem
What is so serious about this is in K8s everything reports as good, no problems. This hides the problem and takes a fair amount of troubleshooting to actually discover what is going on.
I’ve been observing many teams trying to agree on what constitutes a “healthy” node. It’s hilarious how little agreement there is. Regardless, peer discovery plugins use the health check mechanism provided by the backend, if any, and for Consul and etcd this is nothing more than periodic notifications that are only sent when the plugin is running, which in practice means when the node is running, which means it managed to rejoin the cluster (nodes that fail to rejoin eventually stop trying and fail).
Erlang cookie management and deployment tools are orthogonal to peer discovery backends and even node health checks. A node cannot know if it is in the “right” cluster. It also cannot know — with this backend anyway — how many nodes are supposed to be its peers at any given moment.
So your problem goes well beyond health check reporting.
In this specific example, the only reason why nodes 4 and 5 did not stop after unsuccessfully trying to join nodes 1 through 3 is because they managed to join each other. A CLI command cannot solve this problem: it would have reported success for all 5 nodes individually because they are clustered with a peer. As mentioned above, with this discovery method nodes cannot know how many peers are supposed to be there and what they are.
The fundamental problem here is Erlang cookie management, not health checks or health reporting of individual nodes to the discovery backend.
rabbitmqctl node_health_check
already does enough checks to be a good starting point that many teams managed to agree on.
The only check I can think of for backends such as this one is: list all peer members and assert that there are at least N of them. That sounds like a good starting point worth discussing on the list first.
I’m going to add a doc section in the cluster formation guide about this scenario.
I suspect it should be possible to use the classic config backend on Kubernetes just fine since it only requires a RabbitMQ config file entry. With that backend and DNS the number of nodes is known ahead of time and is assumed to be fixed. The downside of this is, well, that it is fixed and that node hostnames must be known ahead of time.
Can you give me the link to list rabbitmq-users? I'm not aware where it's located. Some form of distributed consensus on leader election and cluster membership should be able to avoid the split brain problem even in dynamically sized clusters, it's a pretty well-solved problem now a day. Also, I would like to disagree that the fundamental problem is Erlang cookie management. Cluster membership health is the fundamental problem here. No matter how the external environment is setup, bad Erlang cookies or not, it's important to be able to report correct status of cluster membership.
Hi Aaron - the mailing list is located here.
@michaelklishin This same problem happens even when all cookies are the same and there is no cookie problem. It happens when there is a network partition separating the 2 nodes from the 3 nodes. Two different clusters form, causing split brain.
@AceHack I find it hard to believe. Peer discovery is not involved in partition handling in any way and what you claim to happen is not reported elsewhere, especially when all nodes have the same cookie. This involves cases that use rabbitmq-autocluster, which has been around for years and has had Kubernetes discovery support for over a year IIRC. I suspect the real issue here is general confusion about how RabbitMQ clusters operate.
There is no split brain problem when clusters are resized and the cookie is the same. Newly added nodes will join the existing cluster or fail, unless they can discover a different set of nodes to join. Removing nodes is even more trivial. We've seen all of those scenarios with different automation tools, in particular BOSH, and Kubernetes is in no way special.
What rabbitmq-autocluster added (and now this plugin) is initial cluster member discovery. Nothing else has changed.
Hi @AceHack
History:
The images pivotalrabbitmq/rabbitmq-autocluster:3.7.XXX
were created only to make the examples easy.
The intention is/was not to use it in production.
I created the pivotalrabbitmq/rabbitmq-autocluster:3.7.XXX
images because we weren't sure to ship the rabbitmq-peer-discovery.xxx
plugins with the setup.
Current situation:
rabbitmq-peer-discovery.xxx
plugins are already in the RabbitMQ 3.7.0
setupNext steps:
Hope it is clear now.
@AceHack FYI we changed from pivotal docker image to official docker image. see: https://github.com/rabbitmq/rabbitmq-peer-discovery-k8s/pull/13 Here you can find the new example.
https://github.com/rabbitmq/rabbitmq-peer-discovery-k8s/blob/master/examples/k8s_statefulsets/rabbitmq.yaml
When reporting readiness and liveness to K8s in a StatefulSet one should not report healthy until the rabbit node is joined to the other rabbit nodes in the cluster. The status will report healthy even in the case where there is a wrong erlang cookie and the node was unable to join the cluster. This is a bug and should be addressed by using a different rabbit command that returns health based on cluster membership status as well.
What I am not suggesting is rabbit have any knowledge of K8s at all, I'm suggesting that rabbit should have awareness of its own cluster state no matter where it's running and have the ability for an individual node to report on its own rabbit cluster membership state even when running outside of any orchestrator.
Ideally, there would be a rabbitmqctl command that would return a non-zero (0) exit code when it fails to join any other members of a rabbit cluster, completely unrelated to K8s or any other orchestrator. This command then could be used for the readinessProbe in K8s.
I'm running on K8s v1.8.2 and using rabbitmq:3.7-alpine docker image.
I started a 3 node cluster with a randomly generated erlang secret, then I upscaled that cluster to 5 but the two new nodes had a different randomly generated erlang secret.
You can see the nodes fail to join in the logs from the original 3 node cluster.
rabbitmqctl status
rabbitmqctl environment