pause_minority experiences race condition that takes down entire cluster

eherot commented 4 years ago

RabbitMQ Version: 3.7.14

Environment: 5 Rabbit nodes running on Kubernetes v1.12.0 w/Flannel CNI

What I did: One RabbitMQ pod (rabbitmq-3) was killed (manually, by me).

What resulted: Kubernetes quickly (within a few seconds) rescheduled the pod on another node. Rather than the cluster noticing the loss of a single member and re-mirroring queues elsewhere, the following entries were logged:

Jan 31, 2020 @ 19:19:34.308 rabbitmq-1 * We saw DOWN from rabbit@rabbitmq-3.rabbitmq.default.svc.cluster.local
Jan 31, 2020 @ 19:19:34.308 rabbitmq-1 * We can still see rabbit@rabbitmq-2.rabbitmq.default.svc.cluster.local which can see rabbit@rabbitmq-3.rabbitmq.default.svc.cluster.local
Jan 31, 2020 @ 19:19:34.325 rabbitmq-1 * We saw DOWN from rabbit@rabbitmq-3.rabbitmq.default.svc.cluster.local
Jan 31, 2020 @ 19:19:34.325 rabbitmq-1 * We can still see rabbit@rabbitmq-0.rabbitmq.default.svc.cluster.local which can see rabbit@rabbitmq-3.rabbitmq.default.svc.cluster.local
Jan 31, 2020 @ 19:19:34.326 rabbitmq-2 * We saw DOWN from rabbit@rabbitmq-3.rabbitmq.default.svc.cluster.local
Jan 31, 2020 @ 19:19:34.326 rabbitmq-2 * We can still see rabbit@rabbitmq-1.rabbitmq.default.svc.cluster.local which can see rabbit@rabbitmq-3.rabbitmq.default.svc.cluster.local
Jan 31, 2020 @ 19:19:34.326 rabbitmq-1 * We saw DOWN from rabbit@rabbitmq-3.rabbitmq.default.svc.cluster.local
Jan 31, 2020 @ 19:19:34.326 rabbitmq-1 * We can still see rabbit@rabbitmq-4.rabbitmq.default.svc.cluster.local which can see rabbit@rabbitmq-3.rabbitmq.default.svc.cluster.local
Jan 31, 2020 @ 19:19:34.329 rabbitmq-4 * We saw DOWN from rabbit@rabbitmq-3.rabbitmq.default.svc.cluster.local
Jan 31, 2020 @ 19:19:34.329 rabbitmq-4 * We can still see rabbit@rabbitmq-1.rabbitmq.default.svc.cluster.local which can see rabbit@rabbitmq-3.rabbitmq.default.svc.cluster.local
Jan 31, 2020 @ 19:19:34.332 rabbitmq-0 * We saw DOWN from rabbit@rabbitmq-3.rabbitmq.default.svc.cluster.local
Jan 31, 2020 @ 19:19:34.332 rabbitmq-0 * We can still see rabbit@rabbitmq-1.rabbitmq.default.svc.cluster.local which can see rabbit@rabbitmq-3.rabbitmq.default.svc.cluster.local

(Although I omitted it from the above for clarity, each node also printed pause_minority mode enabled and We will therefore pause until the *entire* cluster recovers)

This indicates to me that there may be a race condition in the logic for determining which nodes are up and which nodes are down (notice the inconsistent reports in the entry above).

This looks like a bug to me, but I'm also wondering if there are network configuration tweaks (e.g. net_ticktime) that might prevent this from occurring in the future.

Note that the resulting chaos of nodes shutting down also caused every node to crash, taking down the management UI with it (I'll be drafting a second issue for that unless one already exists--I haven't looked yet).

michaelklishin commented 4 years ago

Thank you for your time.

Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. This assumes that we have a certain amount of information to work with. I'm afraid we cannot suggest much with the amount of information available here and your environment is 9 patch releases behind in the 3.7.x versions alone.

Getting all the details necessary to reproduce an issue, make a conclusion or even form a hypothesis about what's happening can take a fair amount of time. Our team is multiple orders of magnitude smaller than the RabbitMQ community. Please help others help you by providing a way to reproduce the behavior you're observing and sharing as much relevant information as possible on the list:

Server, client library and plugin (if applicable) versions used
Server logs
A code example or terminal transcript that can be used to reproduce
Full exception stack traces (not a single line message)
rabbitmqctl status (and, if possible, rabbitmqctl environment output)
Other relevant things about the environment and workload, e.g. OS/systemd logs, a traffic capture, deployment tool logs, etc

Feel free to edit out hostnames and other potentially sensitive information.

When/if we have a complete enough understanding of what's going on, a recommendation will be provided or a new issues with more context will be filed.

Thank you.

michaelklishin commented 4 years ago

According to the logs the algorithm in question detected a partial partition on every node at almost the same time. This is a very tricky scenario to handle as some nodes see the peer as down, others see it as up. There can be no consensus between them and no clear path for recovery. When a node not just failed but is replaced within a net tick window, it further adds to the confusion as there can be several scenarios for how the newly joining node will behave depending on whether it is revived with an existing data directory or as a new node.

We have no logs or environment details to suggest much more than "this can happen, partial partitions and distributed race conditions are hard to reason about". And the race condition here is natural natural and unavoidable since the remaining four nodes all run independently and detect peer unavailability independently.

In 4.0 a new consensus (Raft)-based schema data store will be integrated and this partition handling strategy as well as others will go away. Recovery then will be dictated by the Raft recovery mechanism which will tolerate failure of any of two nodes. Partial partition detection and promotion to a "full" partition won't be performed, making this scenario impossible to our knowledge. You can try a different partition handling strategy for the time being.

michaelklishin commented 4 years ago

When a node pauses it keeps the runtime running but otherwise stops. It is not the same thing as a "crash" and I don't see any reason why the node would go down due to the above events. Management UI, all plugins and all TCP listeners will be stopped on a paused node. The process will still be running.

Modern versions provide a set of health checks one of which checks whether a node is running or not. Care should be taken when using it in aliveness checks as replacing a paused node would not help the pause minority strategy make any progress.

michaelklishin commented 4 years ago

Two relevant reasonably recent issues that may be relevant: https://github.com/rabbitmq/rabbitmq-server/issues/2175 (uses pause_minority) and https://github.com/rabbitmq/rabbitmq-server/pull/2209 (uses autoheal, a fix will ship in 3.7.24 and 3.8.3`).

eherot commented 4 years ago

@michaelklishin thank you for the extremely detailed response! I do have more logs available if you'd like to see them, however after going over them in some detail I don't think there is any reason to believe things did not happen exact as you described.

One thing I am a bit curious about: Might it help avoid race conditions like this happening in the future if I were to insert a sleep during the startup sequence so that the cluster could firmly establish which node had disappeared before kubernetes had a chance to reschedule it? It seems in this case like some of the confusion was a result of the node actually being back online by the time some of the other nodes had started to deal with the failure.

In terms of potential short term changes to RabbitMQ that might work around the issue: What about adding a flag or environment variable telling it the correct size of the cluster so that it at least has some way to definitively establish whether it is on the minority side of a net split?

rabbitmq / discussions

pause_minority experiences race condition that takes down entire cluster #58