rabbitmq / rabbitmq-autocluster

RabbitMQ peer discovery and cluster formation plugin, supports RabbitMQ 3.6.x
BSD 3-Clause "New" or "Revised" License
241 stars 54 forks source link

Nodes fail to communicate with peers in AWS Autoscaling Group #21

Closed mrburrito closed 7 years ago

mrburrito commented 7 years ago

I've been trying to get this plugin working for a few days now and cannot seem to get it to create a cluster. Please let me know what I'm doing wrong or if there's a legit bug in the plugin.

I'm running RabbitMQ within a Docker container, hosted on EC2 instances in an AutoScaling Group. There is only one container running on each server.

The attached zip file has the Dockerfile and resources it needs to build.

rabbit-autocluster-docker.zip

My instances use the following User Data script to configure Rabbit as a systemd service (on CentOS 7).

#!/bin/bash
mkdir /root/.docker
chmod 700 /root/.docker
aws configure set default.s3.signature_version s3v4
aws s3 cp s3://my-config-bucket/docker-config-for-private-registry.json /root/.docker/config.json
chmod 600 /root/.docker/config.json

cat >> /etc/systemd/system/rabbit-docker.service <<EOF
[Unit]
Description=RabbitMQ Docker Container
Requires=docker.service
After=docker.service

[Service]
Restart=always
ExecStartPre=/usr/bin/docker volume create --name=rabbit-data
ExecStartPre=/usr/bin/docker pull my.private.registry/myco/rabbit:dev
ExecStart=/usr/bin/docker run --name rabbitmq \
                              --log-driver=awslogs \
                              --log-opt awslogs-region=us-east-1 \
                              --log-opt awslogs-group=/rabbit \
                              --log-opt awslogs-stream=${HOSTNAME} \
                              -p 4369:4369 \
                              -p 5671:5671 \
                              -p 5672:5672 \
                              -p 15672:15672 \
                              -p 25672:25672 \
                              -v rabbit-data:/var/lib/rabbitmq \
                              -e ERLANG_COOKIE=BananaChocolateChip_TryIt_Really_ItsGood \
                              -e RABBITMQ_NODENAME=rabbit@${HOSTNAME}.mydomain.com \
                              -e RABBITMQ_USE_LONGNAME=true \
                              -e AUTOCLUSTER_DELAY=10 \
                              -e AUTOCLUSTER_LOG_LEVEL=debug \
                              -e AUTOCLUSTER_CLEANUP=true \
                              -e CLEANUP_WARN_ONLY=false \
                              -e AWS_AUTOSCALING=true \
                              -e AWS_EC2_TAGS={\"Name\":\"rabbit-autocluster-test\"} \
                              -e AWS_USE_PRIVATE_IP=false \
                              --network host \
                              my.private.registry/myco/rabbit:dev
ExecStop=/usr/bin/docker stop -t 2 rabbitmq
ExecStopPost=/usr/bin/docker rm -f rabbitmq

[Install]
WantedBy=default.target
EOF

systemctl daemon-reload
systemctl start rabbit-docker.service
systemctl enable rabbit-docker.service

The instances are launched in a VPC, with a private subnet connected to a NAT Gateway. The mydomain.com DNS is managed in Route53 and has both forward and reverse lookup entries for all the IP addresses in the subnet (e.g. ip-192-168-205-21.mydomain.com. A 192.168.205.21, 21.205.168.192.in-addr.arpa. PTR ip-192-168-205-21.mydomain.com).

The security group for the nodes allows access to the following ports to all members of the security group:

It also allows traffic on 5672 and 15672 from an ELB (classic) used by clients to connect to the cluster and management ports.

I'm seeing the following error log after the plugin retrieves the Nodes list from AWS:

=INFO REPORT==== 27-Apr-2017::18:24:07 ===
autocluster: Fetching autoscaling = DNS: ["ip-192-168-205-21.ec2.internal"]
=INFO REPORT==== 27-Apr-2017::18:24:07 ===
autocluster: Registering node with aws.
=INFO REPORT==== 27-Apr-2017::18:24:07 ===
autocluster: Registered node with aws.
=INFO REPORT==== 27-Apr-2017::18:24:07 ===
autocluster: Discovered ['rabbit@ip-192-168-205-21.ec2.internal']
=ERROR REPORT==== 27-Apr-2017::18:24:07 ===
autocluster: Can not communicate with cluster nodes.

This occurs when looking up nodes by autoscaling group, tag only, or a combination of the two.

michaelklishin commented 7 years ago

Thank you for your time.

Team RabbitMQ uses GitHub issues for specific actionable items engineers can work on. This assumes we have a certain amount of information to work with.

We get at least a dozen of questions through various venues every single day, often quite light on details. At that rate GitHub issues can very quickly turn into a something impossible to navigate and make sense of even for our team. Because of that questions, investigations, root cause analysis, discussions for potential features are all considered to be mailing list material by our team. Please post this to rabbitmq-users.

Getting all the details necessary to make a conclusion or even form a hypothesis about what's happening can take a fair amount of time. Please help others help you by providing as much relevant information as possible on the list:

Feel free to edit out hostnames and other potentially sensitive information.

When/if we have enough details and evidence we'd be happy to file a new issue.

Thanks you.

michaelklishin commented 7 years ago

This plugin does its job according to the log: it does discover a peer, which means another peer did register successfully.

Inter-node connectivity is in no way affected by this plugin. Please move this to rabbitmq-users.

michaelklishin commented 7 years ago

Generally there are three things needed for nodes to successfully cluster with each other:

Nodes use port 25672 for inter-node communication by default and you have it open, as well as 4369 for the port mapping daemon.

Take a look at the log files on all nodes, there may be clues.