rabbitmq / rabbitmq-autocluster

RabbitMQ peer discovery and cluster formation plugin, supports RabbitMQ 3.6.x
BSD 3-Clause "New" or "Revised" License
242 stars 54 forks source link

AWS instance cannot create cluster with other nodes within the same AWS autoscaling group #49

Closed ghost closed 6 years ago

ghost commented 6 years ago

Hi all, I face a problem while trying to cluster two nodes that belong to the same autoscaling group.

I have two AWS instances (Centos7) within the same AWS autoscaling group and each instance has RabbitMQ 3.6.10 with Erlang/OTP 20 installed. I also installed and enabled the rabbitmq-autocluster plugin 0.8.0

Here's the rabbitmq.config file in both instances:

[ {rabbit, [ {autocluster_log_level, info} ]}, {autocluster, [ {backend, aws}, {aws_autoscaling, true}, {aws_ec2_region, "eu-west-1"}, {aws_access_key, "my_access_key"}, {aws_secret_key, "my_secret_access_key"} ]} ].

I start the first RMQ server in the first instance (rabbit@ip-172-31-20-113). It creates its own single-node cluster as expected.

BUT, when I start the RMQ server in the second instance (rabbit@ip-172-31-16-139) it does not get clustered with the first instance although it recognizes that both of them belong to the same autoscaling group. Here's the rabbitmq log from the second RMQ server (rabbit@ip-172-31-16-139):

=INFO REPORT==== 28-Sep-2017::08:32:30 === autocluster: List of registered nodes retrieved from the backend: ['rabbit@ip-172-31-20-113', 'rabbit@ip-172-31-16-139'] -----> As you can see autocluster plugin retrieved the nodes from the scaling group.

=ERROR REPORT==== 28-Sep-2017::08:32:30 === autocluster: No nodes to choose the preferred from!

=INFO REPORT==== 28-Sep-2017::08:32:30 === autocluster: Picked node as the preferred choice for joining: undefined

=INFO REPORT==== 28-Sep-2017::08:32:30 === autocluster: Running step maybe_cluster

=INFO REPORT==== 28-Sep-2017::08:32:30 === autocluster: We are the first node in the cluster, starting up unconditionally.

Why doesn't the 2nd instance choose to enter the 1st instance cluster?

I would appreciate any help!

michaelklishin commented 6 years ago

Thank you for your time.

Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. This assumes two things:

  1. GitHub issues are not used for questions, investigations, root cause analysis, discussions of potential issues, etc (as defined by this team)
  2. We have a certain amount of information to work with

We get at least a dozen of questions through various venues every single day, often quite light on details. At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because of that questions, investigations, root cause analysis, discussions of potential features are all considered to be mailing list material by our team. Please post this to rabbitmq-users.

Getting all the details necessary to reproduce an issue, make a conclusion or even form a hypothesis about what's happening can take a fair amount of time. Our team is multiple orders of magnitude smaller than the RabbitMQ community. Please help others help you by providing a way to reproduce the behavior you're observing, or at least sharing as much relevant information as possible on the list:

Feel free to edit out hostnames and other potentially sensitive information.

When/if we have enough details and evidence we'd be happy to file a new issue.

Thank you.

michaelklishin commented 6 years ago

@PanosStrouth according to the 2nd node log, it discovered no peers to select from. One very likely reason for this is that the nodes were booting in parallel and autoscaling group membership isn't updated instantly, so N nodes can begin with only themselves on the list. This is a known fundamental problem that this plugin has 2 solutions for (depending on the backend).

See Startup Delay in the README as well as Race Conditions During Initial Cluster Formation in RabbitMQ 3.7.0 Cluster Formation guide that has a more detailed explanation of the same thing. Increasing the delay to, say, 60 seconds (from the default 5) or higher will help.