rabbitmq / rabbitmq-server

Open source RabbitMQ: core server and tier 1 (built-in) plugins
https://www.rabbitmq.com/
Other
12.29k stars 3.91k forks source link

Clustering based on ASG fails if one or more nodes in the ASG is terminated #2528

Open kimma-basefarm opened 6 years ago

kimma-basefarm commented 6 years ago

RabbitMQ nodes will stop with an error if an ASG contains terminated instances that is no longer possible to describe via an EC2 API endpoint:

2018-01-24 08:38:12.257 [error] <0.214.0> Error fetching node list via EC2 API, request path: /?Action=DescribeInstances&InstanceId.3=i-0532xxxdc49605ea5&InstanceId.4=i-034xxxbdc2ad23fe&Version=2015-10-01, error: "Bad Request"
2018-01-24 08:38:12.257 [error] <0.214.0> Cannot discover any nodes: DescribeInstances API call failed.

As you can see it retrieved the instances in the ASG successfully (instanceID 3 and 4 is populated), but one of these are terminated and no longer possible to "describe", which returns a 500 error from the API for the entire request. Even though there is Healthy/InService hosts in the ASG, the node fails to discover these since describe-instances failed.

Perhaps it shoud only return Healthy/inService nodes from the initial describe autoscaling-group that provides the instance IDs, or run the DescribeInstances API request once per instance id, so that it has the ability to fail gracefully on StandBy/Terminated hosts, but still loop through and discover the InService hosts to cluster with.

michaelklishin commented 6 years ago

Thanks for the details. I edited the issue to be less alarming and clearer.

michaelklishin commented 6 years ago

I'm looking into two options:

michaelklishin commented 6 years ago

We decided to introduce an integration suite that will use ASGs first, so this will take longer but I hope to get it into 3.7.4.

michaelklishin commented 6 years ago

A proper test suite is taking longer than expected, so this is now scheduled for 3.7.5.

michaelklishin commented 6 years ago

Related: rabbitmq/rabbitmq-peer-discovery-aws#20.

michaelklishin commented 6 years ago

We currently have quite a few things going into 3.7.5 which we'd like to ship earlier. So this may have to wait, re-scheduling for 3.7.6.

man-jiteshm-sportsbet commented 3 years ago

any update on this? was this fixed in 3.7.6 or still pending ?