while I doing load testing against 3 node rabbitmq cluster in kubernetes env, one rabbitmq pod(10.244.1.40) got restarted, and failed to rejoin the cluster.

Below is the logs it reported, which complained that "Node 'rabbit@10.244.2.37' thinks it's clustered with node 'rabbit@10.244.1.40', but 'rabbit@10.244.1.40' disagrees"

=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Running step find_best_node_to_join
=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: GET https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/wxgigo/endpoints/rabbitmq
=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Response: [{ok,{{"HTTP/1.1",200,"OK"},
                             [{"date","Wed, 15 Nov 2017 15:16:55 GMT"},
                              {"content-length","1024"},
                              {"content-type","application/json"}],
                             "{\"kind\":\"Endpoints\",\"apiVersion\":\"v1\",\"metadata\":{\"name\":\"rabbitmq\",\"namespace\":\"wxgigo\",\"selfLink\":\"/api/v1/namespaces/wxgigo/endpoints/rabbitmq\",\"uid\":\"211b434a-ca14-11e7-8c24-080027aacdc9\",\"resourceVersion\":\"196162\",\"creationTimestamp\":\"2017-11-15T14:49:06Z\",\"labels\":{\"app\":\"rabbitmq\"}},\"subsets\":[{\"addresses\":[{\"ip\":\"10.244.2.37\",\"hostname\":\"rabbitmq-0\",\"nodeName\":\"kube-node1\",\"targetRef\":{\"kind\":\"Pod\",\"namespace\":\"wxgigo\",\"name\":\"rabbitmq-0\",\"uid\":\"21508476-ca14-11e7-8c24-080027aacdc9\",\"resourceVersion\":\"193957\"}},{\"ip\":\"10.244.2.38\",\"hostname\":\"rabbitmq-2\",\"nodeName\":\"kube-node1\",\"targetRef\":{\"kind\":\"Pod\",\"namespace\":\"wxgigo\",\"name\":\"rabbitmq-2\",\"uid\":\"3a7c7d39-ca14-11e7-8c24-080027aacdc9\",\"resourceVersion\":\"194047\"}}],\"notReadyAddresses\":[{\"ip\":\"10.244.1.40\",\"hostname\":\"rabbitmq-1\",\"nodeName\":\"kube-node2\",\"targetRef\":{\"kind\":\"Pod\",\"namespace\":\"wxgigo\",\"name\":\"rabbitmq-1\",\"uid\":\"2dd9f567-ca14-11e7-8c24-080027aacdc9\",\"resourceVersion\":\"196160\"}}],\"ports\":[{\"name\":\"amqp\",\"port\":5672,\"protocol\":\"TCP\"}]}]}\n"}}]
=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: k8s endpoint listing returned nodes not yet ready: 10.244.1.40
=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: List of registered nodes retrieved from the backend: ['rabbit@10.244.2.38',
                                                                   'rabbit@10.244.2.37']

=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Fetching node details. Unreachable nodes (or nodes that responded with an error): []
=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Fetching node details. Responses: [{candidate_seed_node,
                                                 'rabbit@10.244.2.38',1620526,
                                                 true,
                                                 ['rabbit@10.244.2.38',
                                                  'rabbit@10.244.2.37',
                                                  'rabbit@10.244.1.40'],
                                                 ['rabbit@10.244.2.37',
                                                  'rabbit@10.244.2.38'],
                                                 [],[]},
                                                {candidate_seed_node,
                                                 'rabbit@10.244.2.37',1663139,
                                                 true,
                                                 ['rabbit@10.244.2.38',
                                                  'rabbit@10.244.2.37',
                                                  'rabbit@10.244.1.40'],
                                                 ['rabbit@10.244.2.38',
                                                  'rabbit@10.244.2.37'],
                                                 [],[]}]

=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Asked to choose preferred node from the list of: [{candidate_seed_node,
                                                                'rabbit@10.244.2.38',
                                                                1620526,true,
                                                                ['rabbit@10.244.2.38',
                                                                 'rabbit@10.244.2.37',
                                                                 'rabbit@10.244.1.40'],
                                                                ['rabbit@10.244.2.37',
                                                                 'rabbit@10.244.2.38'],
                                                                [],[]},
                                                               {candidate_seed_node,
                                                                'rabbit@10.244.2.37',
                                                                1663139,true,
                                                                ['rabbit@10.244.2.38',
                                                                 'rabbit@10.244.2.37',
                                                                 'rabbit@10.244.1.40'],
                                                                ['rabbit@10.244.2.38',
                                                                 'rabbit@10.244.2.37'],
                                                                [],[]}]
=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Filtered node list (does not include us and non-running/reachable nodes): [{candidate_seed_node,
                                                                                         'rabbit@10.244.2.37',
                                                                                         1663139,
                                                                                         true,
                                                                                         ['rabbit@10.244.2.38',
                                                                                          'rabbit@10.244.2.37',
                                                                                          'rabbit@10.244.1.40'],
                                                                                         ['rabbit@10.244.2.38',
                                                                                          'rabbit@10.244.2.37'],
                                                                                         [],
                                                                                         []},
                                                                                        {candidate_seed_node,
                                                                                         'rabbit@10.244.2.38',
                                                                                         1620526,
                                                                                         true,
                                                                                         ['rabbit@10.244.2.38',
                                                                                          'rabbit@10.244.2.37',
                                                                                          'rabbit@10.244.1.40'],
                                                                                         ['rabbit@10.244.2.37',
                                                                                          'rabbit@10.244.2.38'],
                                                                                         [],
                                                                                         []}]

=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Picked node as the preferred choice for joining: 'rabbit@10.244.2.37'
=INFO REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Running step maybe_cluster
=ERROR REPORT==== 15-Nov-2017::15:16:55 ===
Node 'rabbit@10.244.2.37' thinks it's clustered with node 'rabbit@10.244.1.40', but 'rabbit@10.244.1.40' disagrees
=ERROR REPORT==== 15-Nov-2017::15:16:55 ===
autocluster: Step maybe_cluster failed, will conitnue nevertheless. Failure reason: Failed to cluster with rabbit@10.244.2.37: {inconsistent_cluster,[78,111,100,101,32,39,114,97,98,98,105,116,64,49,48,46,50,52,52,46,50,46,51,55,39,32,116,104,105,110,107,115,32,105,116,39,115,32,99,108,117,115,116,101,114,101,100,32,119,105,116,104,32,110,111,100,101,32,39,114,97,98,98,105,116,64,49,48,46,50,52,52,46,49,46,52,48,39,44,32,98,117,116,32,39,114,97,98,98,105,116,64,49,48,46,50,52,52,46,49,46,52,48,39,32,100,105,115,97,103,114,101,101,115]}.

Later, I cluster 10.244.1.40 with 10.244.2.37 successfully. So what's could be the possible reason? Is it possible rabbitmq cluster still not kick off the bad node(CLEANUP) from it's cluster info when the node try to rejoin again? Is it reduce the value of CLEANUP_INTERVAL helpful?

Thank you for your time.

Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. This assumes two things:

GitHub issues are not used for questions, investigations, root cause analysis, discussions of potential issues, etc (as defined by this team)
We have a certain amount of information to work with

We get at least a dozen of questions through various venues every single day, often quite light on details. At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because of that questions, investigations, root cause analysis, discussions of potential features are all considered to be mailing list material by our team. Please post this to rabbitmq-users.

Getting all the details necessary to reproduce an issue, make a conclusion or even form a hypothesis about what's happening can take a fair amount of time. Our team is multiple orders of magnitude smaller than the RabbitMQ community. Please help others help you by providing a way to reproduce the behavior you're observing, or at least sharing as much relevant information as possible on the list:

Server, client library and plugin (if applicable) versions used
Server logs
A code example or terminal transcript that can be used to reproduce
Full exception stack traces (not a single line message)
rabbitmqctl status (and, if possible, rabbitmqctl environment output)
Other relevant things about the environment and workload, e.g. a traffic capture

Feel free to edit out hostnames and other potentially sensitive information.

When/if we have enough details and evidence we'd be happy to file a new issue.

Thank you.

This generally means that node A tried to cluster with B and that failed. Two specific cases which return this error is:

Nodes run minor Erlang versions that do not match (e.g. 19.2 and 19.3 or even 18.3 and 19.3)
RabbitMQ versions do not match (3.6.x cannot be clustered with 3.7.x)

The error is formatted as a list of ASCII numbers but it says

Node 'rabbit@10.244.2.37' thinks it's clustered with node 'rabbit@10.244.1.40', but 'rabbit@10.244.1.40' disagrees

This can happen when you cluster A and B, then stop them, reset B, start both them back up (or similar).

It uses the latest 3.6.14 version, 3 pods are the replicated one, so there should not be caused by mismatched version issue. I think the problem comes from cleanup logic. When bad node exhaust system resources during high load, it (10.244.1.40) failed or got slow communication with other nodes, somehow the cleanup logic triggered on this bad node, which may execute forget peer node action on bad node before it restart. And at this time, other nodes still think bad node in cluster disc。This looks like the inconsistent cluster example in rabbitmq official cluster document. I will disable AUTOCLUSTER_CLEANUP feature and test again to verify that tomorrow. Anyway, thanks for your comments.

@rfancn Can I ask if you are using one of the examples or another script?

@Gsantomaggio I use the statefulset example

I did test again, and found following logs which may verify my conclusion.

=ERROR REPORT==== 16-Nov-2017::05:32:24 ===
** Node 'rabbit@10.244.1.48' not responding **
** Removing (timedout) connection **
=INFO REPORT==== 16-Nov-2017::05:32:24 ===
rabbit on node 'rabbit@10.244.1.48' down
=INFO REPORT==== 16-Nov-2017::05:32:24 ===
node 'rabbit@10.244.1.48' down: net_tick_timeout
=INFO REPORT==== 16-Nov-2017::05:32:36 ===
autocluster: (cleanup) checking cluster
=INFO REPORT==== 16-Nov-2017::05:32:36 ===
autocluster: (cleanup) Checking for partitioned nodes.
=INFO REPORT==== 16-Nov-2017::05:32:36 ===
autocluster: (cleanup) Unreachable RabbitMQ nodes ['rabbit@10.244.1.48']
=ERROR REPORT==== 16-Nov-2017::05:32:39 ===
autocluster: (cleanup) removing 'rabbit@10.244.1.48' from cluster
=INFO REPORT==== 16-Nov-2017::05:32:36 ===
autocluster: GET https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/wxgigo/endpoint
...

EVENT SEQUENCE

rabbit@10.244.1.48 down because exhausting of system resource
remove itself 10.244.1.48 from cluster by cleanup logic
it will try to rejoin the cluster soon, but at this time, the other cluster still think 10.244.1.48 still in cluster, but crashed node(10.244.1.48) disagrees
if AUTOCLUSTER_FAILURE is not set, which has "ignore" value by default, the crashed node will never ever join the cluster again because it will skip the inconsistent cluster failure it encountered before, and works in standalone mode

WORKAROUND

set AUTOCLUSTER_FAILURE to be "stop" when autocluster failed to join the cluster, which will cause the k8s liveness probing failure, and in turn restart the pod, hopefully, it can join the cluster now because cleanup logic with 60s interval already removed the crashed node from cluster info in other nodes, if this is not the case, the node will continue restart again and again, util it can join the cluster.

NEW ISSUE

In the load testing, I simulate sending tens of thousands of messages to rabbitmq server, and triggered the crash as above. But when the rabbitmq pod restarted, it will cause the hang of the node it running on, and it will report errors as below:

=INFO REPORT==== 16-Nov-2017::05:32:36 ===
autocluster: GET https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/wxgigo/endpoint                                                                                                                                          s/rabbitmq
=INFO REPORT==== 16-Nov-2017::05:32:39 ===
autocluster: Response: [{error,
                            {failed_connect,
                                [{to_address,
                                     {"kubernetes.default.svc.cluster.local",
                                      443}},
                                 {inet,[inet],econnrefused}]}}]
=INFO REPORT==== 16-Nov-2017::05:32:39 ===
autocluster: HTTP Error {failed_connect,
                            [{to_address,
                                 {"kubernetes.default.svc.cluster.local",443}},
                             {inet,[inet],econnrefused}]}
Error in log handler
====================
Event: {info_msg,<0.136.0>,
           {<0.139.0>,"autocluster: Failed to get nodes from k8s - ~s~n",
            [{failed_connect,
                 [{to_address,{"kubernetes.default.svc.cluster.local",443}},
                  {inet,[inet],econnrefused}]}]}}
Error: badarg
Stack trace: [{io_lib,format,
                  ["autocluster: Failed to get nodes from k8s - ~s~n",
                   [{failed_connect,
                        [{to_address,
                             {"kubernetes.default.svc.cluster.local",443}},
                         {inet,[inet],econnrefused}]}]],
                  [{file,"io_lib.erl"},{line,168}]},
              {rabbit_error_logger,publish1,4,
                  [{file,"src/rabbit_error_logger.erl"},{line,108}]},
              {rabbit_error_logger,handle_event0,2,
                  [{file,"src/rabbit_error_logger.erl"},{line,80}]},
              {rabbit_error_logger_file_h,safe_handle_event,3,
                  [{file,"src/rabbit_error_logger_file_h.erl"},{line,121}]},
              {gen_event,server_update,4,[{file,"gen_event.erl"},{line,533}]},
              {gen_event,server_notify,4,[{file,"gen_event.erl"},{line,515}]},
              {gen_event,handle_msg,5,[{file,"gen_event.erl"},{line,256}]},
              {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]
=INFO REPORT==== 16-Nov-2017::05:32:39 ===
ERROR: "autocluster: Failed to get nodes from k8s - ~s~n" - [{failed_connect,
                                                              [{to_address,
                                                                {"kubernetes.default.svc.cluster.lo                                                                                                                                          cal",
                                                                 443}},
                                                               {inet,
                                                                [inet],
                                                                econnrefused}]}]
=INFO REPORT==== 16-Nov-2017::05:32:39 ===
autocluster: (cleanup) autocluster_k8s returned error {failed_connect,
                                                       [{to_address,
                                                         {"kubernetes.default.svc.cluster.local",
                                                          443}},
                                                        {inet,
                                                         [inet],
                                                         econnrefused}]}
=INFO REPORT==== 16-Nov-2017::06:09:36 ===
ERROR: "autocluster: Failed to get nodes from k8s - ~s~n" - [{failed_connect,
                                                             [{to_address,
                                                               {"kubernetes.default.svc.cluster.local",
                                                                443}},
                                                              {inet,
                                                               [inet],
                                                               nxdomain}]}]
=ERROR REPORT==== 16-Nov-2017::06:09:36 ===
autocluster: Step find_best_node_to_join failed, halting startup. Failure reason: Failed to fetch list of nodes from the backend: {failed_connect,
                                                [{to_address,
                                                  {"kubernetes.default.svc.cluster.local",
                                                   443}},
                                                 {inet,[inet],nxdomain}]}.
=CRASH REPORT==== 16-Nov-2017::06:09:36 ===
 crasher:
   initial call: application_master:init/4
   pid: <0.4429.0>
   registered_name: []
   exception exit: {bad_return,
                       {{rabbit,start,[normal,[]]},
                        {'EXIT',
                            {error,
                                "Failed to fetch list of nodes from the backend: {failed_connect,\n                                                 [{to_address,\n                                                   {\"kubernetes.default.svc.cluster.local\",\n                                                    443}},\n                                                  {inet,[inet],nxdomain}]}"}}}}
     in function  application_master:init/4 (application_master.erl, line 134)
   ancestors: [<0.4428.0>]
   messages: [{'EXIT',<0.4430.0>,normal}]
   links: [<0.4428.0>,<0.31.0>]
   dictionary: []
   trap_exit: true
   status: running
   heap_size: 2586
   stack_size: 27
   reductions: 98
 neighbours:
=INFO REPORT==== 16-Nov-2017::06:09:36 ===
   application: rabbit
   exited: {bad_return,
               {{rabbit,start,[normal,[]]},
                {'EXIT',
                    {error,
                        "Failed to fetch list of nodes from the backend: {failed_connect,\n                                                 [{to_address,\n                                                   {\"kubernetes.default.svc.cluster.local\",\n                                                    443}},\n                                                  {inet,[inet],nxdomain}]}"}}}}
   type: transient
<0.4314.0>#015
{badarg,[{ets,lookup,[,{application_controller,get_env,2,[{file,"application_controller.erl"},{line,332}]},{rabbit_log,catlevel,1,[{file,"src/rabbit_log.erl"},{line,68}]},{rabbit_log,log,4,[{file,"src/rabbit_log.erl"},{line,37}]}]}#015
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{error,\"Failed to fetch list of nodes from the backend: {failed_connect,\n                                                 [{to_address,\n                                                   {\\"kubernetes.default.svc.cluster.local\\",\n                                                    443}},\n                                                  {inet,[inet],nxdomain}]}\"}}}}}"}#015
Crash dump is being written to: erl_crash.dump...
Crash dump is being written to: erl_crash.dump...

From above logs, when autocluster failed to get k8s rabbitmq endpoints for some reason, which in fact failed to fetch list of nodes from the backend, it will crash the rabbit application. I assume this is the expected behavior when set AUTOCLUSTER_FAILURE to be "stop", but crash dump will cause long time to be finished also seems consume much of system resource while dumping, so is there any consideration to add switch to control generating dump or not, because previous debug message is enough to identify the issue, no dump file needed in this case.

rabbitmq / rabbitmq-autocluster

autocluster: Step maybe_cluster failed with failure: inconsistent_cluster #62

EVENT SEQUENCE

WORKAROUND

NEW ISSUE