Open kimma-basefarm opened 6 years ago
Thanks for the details. I edited the issue to be less alarming and clearer.
I'm looking into two options:
instance-state-name
We decided to introduce an integration suite that will use ASGs first, so this will take longer but I hope to get it into 3.7.4.
A proper test suite is taking longer than expected, so this is now scheduled for 3.7.5.
Related: rabbitmq/rabbitmq-peer-discovery-aws#20.
We currently have quite a few things going into 3.7.5
which we'd like to ship earlier. So this may have to wait, re-scheduling for 3.7.6
.
any update on this? was this fixed in 3.7.6 or still pending ?
RabbitMQ nodes will stop with an error if an ASG contains terminated instances that is no longer possible to describe via an EC2 API endpoint:
As you can see it retrieved the instances in the ASG successfully (instanceID 3 and 4 is populated), but one of these are terminated and no longer possible to "describe", which returns a 500 error from the API for the entire request. Even though there is Healthy/InService hosts in the ASG, the node fails to discover these since describe-instances failed.
Perhaps it shoud only return Healthy/inService nodes from the initial describe autoscaling-group that provides the instance IDs, or run the DescribeInstances API request once per instance id, so that it has the ability to fail gracefully on StandBy/Terminated hosts, but still loop through and discover the InService hosts to cluster with.