Closed rfancn closed 6 years ago
Thank you for your time.
Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. This assumes two things:
We get at least a dozen of questions through various venues every single day, often quite light on details. At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because of that questions, investigations, root cause analysis, discussions of potential features are all considered to be mailing list material by our team. Please post this to rabbitmq-users.
Getting all the details necessary to reproduce an issue, make a conclusion or even form a hypothesis about what's happening can take a fair amount of time. Our team is multiple orders of magnitude smaller than the RabbitMQ community. Please help others help you by providing a way to reproduce the behavior you're observing, or at least sharing as much relevant information as possible on the list:
rabbitmqctl status
(and, if possible, rabbitmqctl environment
output)Feel free to edit out hostnames and other potentially sensitive information.
When/if we have enough details and evidence we'd be happy to file a new issue.
Thank you.
This generally means that node A tried to cluster with B and that failed. Two specific cases which return this error is:
The error is formatted as a list of ASCII numbers but it says
Node 'rabbit@10.244.2.37' thinks it's clustered with node 'rabbit@10.244.1.40', but 'rabbit@10.244.1.40' disagrees
This can happen when you cluster A and B, then stop them, reset B, start both them back up (or similar).
It uses the latest 3.6.14 version, 3 pods are the replicated one, so there should not be caused by mismatched version issue. I think the problem comes from cleanup logic. When bad node exhaust system resources during high load, it (10.244.1.40) failed or got slow communication with other nodes, somehow the cleanup logic triggered on this bad node, which may execute forget peer node action on bad node before it restart. And at this time, other nodes still think bad node in cluster disc。This looks like the inconsistent cluster example in rabbitmq official cluster document. I will disable AUTOCLUSTER_CLEANUP feature and test again to verify that tomorrow. Anyway, thanks for your comments.
@rfancn Can I ask if you are using one of the examples or another script?
@Gsantomaggio I use the statefulset example
I did test again, and found following logs which may verify my conclusion.
=ERROR REPORT==== 16-Nov-2017::05:32:24 ===
** Node 'rabbit@10.244.1.48' not responding **
** Removing (timedout) connection **
=INFO REPORT==== 16-Nov-2017::05:32:24 ===
rabbit on node 'rabbit@10.244.1.48' down
=INFO REPORT==== 16-Nov-2017::05:32:24 ===
node 'rabbit@10.244.1.48' down: net_tick_timeout
=INFO REPORT==== 16-Nov-2017::05:32:36 ===
autocluster: (cleanup) checking cluster
=INFO REPORT==== 16-Nov-2017::05:32:36 ===
autocluster: (cleanup) Checking for partitioned nodes.
=INFO REPORT==== 16-Nov-2017::05:32:36 ===
autocluster: (cleanup) Unreachable RabbitMQ nodes ['rabbit@10.244.1.48']
=ERROR REPORT==== 16-Nov-2017::05:32:39 ===
autocluster: (cleanup) removing 'rabbit@10.244.1.48' from cluster
=INFO REPORT==== 16-Nov-2017::05:32:36 ===
autocluster: GET https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/wxgigo/endpoint
...
set AUTOCLUSTER_FAILURE to be "stop" when autocluster failed to join the cluster, which will cause the k8s liveness probing failure, and in turn restart the pod, hopefully, it can join the cluster now because cleanup logic with 60s interval already removed the crashed node from cluster info in other nodes, if this is not the case, the node will continue restart again and again, util it can join the cluster.
In the load testing, I simulate sending tens of thousands of messages to rabbitmq server, and triggered the crash as above. But when the rabbitmq pod restarted, it will cause the hang of the node it running on, and it will report errors as below:
=INFO REPORT==== 16-Nov-2017::05:32:36 ===
autocluster: GET https://kubernetes.default.svc.cluster.local:443/api/v1/namespaces/wxgigo/endpoint s/rabbitmq
=INFO REPORT==== 16-Nov-2017::05:32:39 ===
autocluster: Response: [{error,
{failed_connect,
[{to_address,
{"kubernetes.default.svc.cluster.local",
443}},
{inet,[inet],econnrefused}]}}]
=INFO REPORT==== 16-Nov-2017::05:32:39 ===
autocluster: HTTP Error {failed_connect,
[{to_address,
{"kubernetes.default.svc.cluster.local",443}},
{inet,[inet],econnrefused}]}
Error in log handler
====================
Event: {info_msg,<0.136.0>,
{<0.139.0>,"autocluster: Failed to get nodes from k8s - ~s~n",
[{failed_connect,
[{to_address,{"kubernetes.default.svc.cluster.local",443}},
{inet,[inet],econnrefused}]}]}}
Error: badarg
Stack trace: [{io_lib,format,
["autocluster: Failed to get nodes from k8s - ~s~n",
[{failed_connect,
[{to_address,
{"kubernetes.default.svc.cluster.local",443}},
{inet,[inet],econnrefused}]}]],
[{file,"io_lib.erl"},{line,168}]},
{rabbit_error_logger,publish1,4,
[{file,"src/rabbit_error_logger.erl"},{line,108}]},
{rabbit_error_logger,handle_event0,2,
[{file,"src/rabbit_error_logger.erl"},{line,80}]},
{rabbit_error_logger_file_h,safe_handle_event,3,
[{file,"src/rabbit_error_logger_file_h.erl"},{line,121}]},
{gen_event,server_update,4,[{file,"gen_event.erl"},{line,533}]},
{gen_event,server_notify,4,[{file,"gen_event.erl"},{line,515}]},
{gen_event,handle_msg,5,[{file,"gen_event.erl"},{line,256}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]
=INFO REPORT==== 16-Nov-2017::05:32:39 ===
ERROR: "autocluster: Failed to get nodes from k8s - ~s~n" - [{failed_connect,
[{to_address,
{"kubernetes.default.svc.cluster.lo cal",
443}},
{inet,
[inet],
econnrefused}]}]
=INFO REPORT==== 16-Nov-2017::05:32:39 ===
autocluster: (cleanup) autocluster_k8s returned error {failed_connect,
[{to_address,
{"kubernetes.default.svc.cluster.local",
443}},
{inet,
[inet],
econnrefused}]}
=INFO REPORT==== 16-Nov-2017::06:09:36 ===
ERROR: "autocluster: Failed to get nodes from k8s - ~s~n" - [{failed_connect,
[{to_address,
{"kubernetes.default.svc.cluster.local",
443}},
{inet,
[inet],
nxdomain}]}]
=ERROR REPORT==== 16-Nov-2017::06:09:36 ===
autocluster: Step find_best_node_to_join failed, halting startup. Failure reason: Failed to fetch list of nodes from the backend: {failed_connect,
[{to_address,
{"kubernetes.default.svc.cluster.local",
443}},
{inet,[inet],nxdomain}]}.
=CRASH REPORT==== 16-Nov-2017::06:09:36 ===
crasher:
initial call: application_master:init/4
pid: <0.4429.0>
registered_name: []
exception exit: {bad_return,
{{rabbit,start,[normal,[]]},
{'EXIT',
{error,
"Failed to fetch list of nodes from the backend: {failed_connect,\n [{to_address,\n {\"kubernetes.default.svc.cluster.local\",\n 443}},\n {inet,[inet],nxdomain}]}"}}}}
in function application_master:init/4 (application_master.erl, line 134)
ancestors: [<0.4428.0>]
messages: [{'EXIT',<0.4430.0>,normal}]
links: [<0.4428.0>,<0.31.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 2586
stack_size: 27
reductions: 98
neighbours:
=INFO REPORT==== 16-Nov-2017::06:09:36 ===
application: rabbit
exited: {bad_return,
{{rabbit,start,[normal,[]]},
{'EXIT',
{error,
"Failed to fetch list of nodes from the backend: {failed_connect,\n [{to_address,\n {\"kubernetes.default.svc.cluster.local\",\n 443}},\n {inet,[inet],nxdomain}]}"}}}}
type: transient
<0.4314.0>#015
{badarg,[{ets,lookup,[,{application_controller,get_env,2,[{file,"application_controller.erl"},{line,332}]},{rabbit_log,catlevel,1,[{file,"src/rabbit_log.erl"},{line,68}]},{rabbit_log,log,4,[{file,"src/rabbit_log.erl"},{line,37}]}]}#015
{"Kernel pid terminated",application_controller,"{application_start_failure,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{error,\"Failed to fetch list of nodes from the backend: {failed_connect,\n [{to_address,\n {\\"kubernetes.default.svc.cluster.local\\",\n 443}},\n {inet,[inet],nxdomain}]}\"}}}}}"}#015
Crash dump is being written to: erl_crash.dump...
Crash dump is being written to: erl_crash.dump...
From above logs, when autocluster failed to get k8s rabbitmq endpoints for some reason, which in fact failed to fetch list of nodes from the backend, it will crash the rabbit application. I assume this is the expected behavior when set AUTOCLUSTER_FAILURE to be "stop", but crash dump will cause long time to be finished also seems consume much of system resource while dumping, so is there any consideration to add switch to control generating dump or not, because previous debug message is enough to identify the issue, no dump file needed in this case.
while I doing load testing against 3 node rabbitmq cluster in kubernetes env, one rabbitmq pod(10.244.1.40) got restarted, and failed to rejoin the cluster.
Below is the logs it reported, which complained that "Node 'rabbit@10.244.2.37' thinks it's clustered with node 'rabbit@10.244.1.40', but 'rabbit@10.244.1.40' disagrees"
Later, I cluster 10.244.1.40 with 10.244.2.37 successfully. So what's could be the possible reason? Is it possible rabbitmq cluster still not kick off the bad node(CLEANUP) from it's cluster info when the node try to rejoin again? Is it reduce the value of CLEANUP_INTERVAL helpful?