rabbitmq / rabbitmq-autocluster

RabbitMQ peer discovery and cluster formation plugin, supports RabbitMQ 3.6.x
BSD 3-Clause "New" or "Revised" License
241 stars 54 forks source link

rabbitmq-autocluster failing #66

Closed srflaxu40 closed 7 years ago

srflaxu40 commented 7 years ago

Have been at this for a few days:

/ # rabbitmqctl join_cluster rabbit@192.168.82.133 Clustering node 'rabbit@192.168.8.6' with 'rabbit@192.168.82.133' Error: {inconsistent_cluster,"Node 'rabbit@192.168.82.133' thinks it's clustered with node 'rabbit@192.168.8.6', but 'rabbit@192.168.8.6' disagrees"}

It doesn't seem that whatever I do I cannot get the second node to join properly nor have it join as a "disc" node. It continually joins as a ram node even when performing manually.

Here is the environment:

/ # env | grep RABBITMQ RABBITMQ_USE_LONGNAME=true RABBITMQ_DISK_FREE_LIMIT="8GiB" RABBITMQ_PORT_15672_TCP=tcp://10.110.213.110:15672 RABBITMQ_PORT_25672_TCP=tcp://10.110.213.110:25672 RABBITMQ_LOGS=- RABBITMQ_MANAGER_PORT_NUMBER=15672 RABBITMQ_NODENAME=rabbit@192.168.8.6 RABBITMQ_SERVICE_PORT_HTTP=15672 RABBITMQ_PLUGINS_EXPAND_DIR=/var/lib/rabbitmq/plugins RABBITMQ_PASSWORD=abc123 RABBITMQ_VERSION=3.6.14 RABBITMQ_PLUGINS_DIR=/usr/lib/rabbitmq/plugins RABBITMQ_SERVICE_HOST=10.110.213.110 RABBITMQ_SASL_LOGS=- RABBITMQ_NODE_TYPE=stats RABBITMQ_BASE=/rabbitmq RABBITMQ_PORT_5672_TCP_ADDR=10.110.213.110 RABBITMQ_PORT_4369_TCP_ADDR=10.110.213.110 RABBITMQ_SERVICE_PORT_EPMD=4369 RABBITMQ_SERVICE_PORT=15672 RABBITMQ_PORT=tcp://10.110.213.110:15672 RABBITMQ_PORT_5672_TCP_PORT=5672 RABBITMQ_PORT_5672_TCP_PROTO=tcp RABBITMQ_VHOST=/ RABBITMQ_PORT_4369_TCP_PORT=4369 RABBITMQ_PORT_4369_TCP_PROTO=tcp RABBITMQ_NODE_PORT_NUMBER=5672 RABBITMQ_PID_FILE=/var/lib/rabbitmq/rabbitmq.pid RABBITMQ_SERVER_ERL_ARGS=+K true +A128 +P 1048576 -kernel inet_default_connect_options [{nodelay,true}] RABBITMQ_PORT_15672_TCP_ADDR=10.110.213.110 RABBITMQ_SERVICE_PORT_AMQP=5672 RABBITMQ_PORT_25672_TCP_ADDR=10.110.213.110 RABBITMQ_MNESIA_DIR=/var/lib/rabbitmq/mnesia RABBITMQ_PORT_5672_TCP=tcp://10.110.213.110:5672 RABBITMQ_PORT_15672_TCP_PORT=15672 RABBITMQ_USERNAME=user RABBITMQ_HOME=/rabbitmq RABBITMQ_PORT_4369_TCP=tcp://10.110.213.110:4369 RABBITMQ_PORT_15672_TCP_PROTO=tcp RABBITMQ_PORT_25672_TCP_PORT=25672 RABBITMQ_PORT_25672_TCP_PROTO=tcp RABBITMQ_DIST_PORT=25672 RABBITMQ_SERVICE_PORT_DIST=25672

michaelklishin commented 7 years ago

Thank you for your time.

Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. This assumes two things:

  1. GitHub issues are not used for questions, investigations, root cause analysis, discussions of potential issues, etc (as defined by this team)
  2. We have a certain amount of information to work with

We get at least a dozen of questions through various venues every single day, often quite light on details. At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because of that questions, investigations, root cause analysis, discussions of potential features are all considered to be mailing list material by our team. Please post this to rabbitmq-users.

Getting all the details necessary to reproduce an issue, make a conclusion or even form a hypothesis about what's happening can take a fair amount of time. Our team is multiple orders of magnitude smaller than the RabbitMQ community. Please help others help you by providing a way to reproduce the behavior you're observing, or at least sharing as much relevant information as possible on the list:

Feel free to edit out hostnames and other potentially sensitive information.

When/if we have enough details and evidence we'd be happy to file a new issue.

Thank you.

michaelklishin commented 7 years ago

Node 'rabbit@192.168.82.133' thinks it's clustered with node 'rabbit@192.168.8.6', but 'rabbit@192.168.8.6' disagrees

appears in this repository's issues as well as many other places. It means one node was reset and another one wasn't, so A thinks it is not already clustered with B and thus can join it but A disagrees. Resetting B will help. How exactly you can end up with this situation with various provisioning tools, I cannot know.

srflaxu40 commented 7 years ago

Hey, thanks @michaelklishin - sorry for not including greater details I also am hitting up the #autocluster channel in rmq slack. It seems even trying to manually join it still fails with the same error. Here is some more debugging i have done:

/ # rabbitmqctl reset Resetting node 'rabbit@192.168.82.134' Error: Mnesia is still running on node 'rabbit@192.168.82.134'. Please stop the node with rabbitmqctl stop_app first. / # rabbitmqctl stop_app Stopping rabbit application on node 'rabbit@192.168.82.134' / # rabbitmqctl reset Resetting node 'rabbit@192.168.82.134'

Appears the solution is I have to remote forget cluster node (not from same node):

/ # rabbitmqctl join_cluster rabbit@192.168.8.7 Clustering node 'rabbit@192.168.82.134' with 'rabbit@192.168.8.7' Error: {inconsistent_cluster,"Node 'rabbit@192.168.8.7' thinks it's clustered with node 'rabbit@192.168.82.134', but 'rabbit@192.168.82.134' disagrees"} / # rabbitmqctl join_cluster rabbit@192.168.8.7 Clustering node 'rabbit@192.168.82.134' with 'rabbit@192.168.8.7'

However, when I run the suggested cluster status:

/ # rabbitmqctl cluster_status Cluster status of node 'rabbit@192.168.8.7' [{nodes,[{disc,['rabbit@192.168.8.7','rabbit@192.168.82.134']}]}, {running_nodes,['rabbit@192.168.8.7']}, {cluster_name,<<"rabbit@rabbitmq-statefulset-development-0.rabbitmq.default.svc.cluster.local">>}, {partitions,[]}, {alarms,[{'rabbit@192.168.8.7',[]}]}]

I see two disc nodes but only one running. Upon inspection.. So I start the app on the downnode:

~/ops-tools/build-files/rabbitmq$ ./test_status.sh Cluster status of node 'rabbit@192.168.8.7' [{nodes,[{disc,['rabbit@192.168.8.7','rabbit@192.168.82.134']}]}, {running_nodes,['rabbit@192.168.82.134','rabbit@192.168.8.7']}, {cluster_name,<<"rabbit@rabbitmq-statefulset-development-0.rabbitmq.default.svc.cluster.local">>}, {partitions,[]}, {alarms,[{'rabbit@192.168.82.134',[]},{'rabbit@192.168.8.7',[]}]}]

Then works ^^. I thought this would be something handled by the plugin using default settings in my statefulset and service which I took from this repo..

michaelklishin commented 7 years ago

Mnesia is still running on node 'rabbit@192.168.82.134'. Please stop the node with rabbitmqctl stop_app first

has a hint.

michaelklishin commented 7 years ago

As the README for this plugin states, it is not a replacement for understanding of the basics of cluster formation. Please follow the clustering 101 transcript on rabbitmq.com and the meaning of the message(s) will be clearer.

srflaxu40 commented 7 years ago

@michaelklishin I understand but I feel it's a little more than that. The issue is on boot up one broker starts find however the second cannot cluster with the first. both are started using the defaults in the k8s examples.

I had even attempted cleaning up the mnesia stuff as I had found elsewhere:

` rm -rf /var/lib/rabbitmq/ | true rm -rf /rabbitmq/var/lib/rabbitmq/ | true

rabbitmq-server -detached `

srflaxu40 commented 7 years ago

The second broker always starts / fails to join and crashes with the generic "node disagrees agree" error.

michaelklishin commented 7 years ago

Removing a data directory without first stopping the node won’t get you where you want. There is only one scenario which produces the error message in question.

This is not a support forum. Please post step by step instructions to reproduce to rabbitmq-users or we won’t be able to help you.

michaelklishin commented 7 years ago

Alternatively nodes can be reset without restarting with rabbitmqctl reset. It’s a good idea to reset both nodes before trying further.

michaelklishin commented 7 years ago

Steps to roughly get into the state @srflaxu40's nodes are:

How do we get out of this state? Reset node B or stop it and wipe its data directory, then restart.

This really isn't rocket science.