FEATURE: Revisit retry-join value on failure

spanktar commented 8 years ago

`consul version` for both Client and Server

Client: 0.7.0 Server: 0.7.0

Operating system and Environment details

AWS AMI

Description of the Issue:

We have tested, and are using, the DNS entry of an ELB that is serving a cluster of Consul server instances in AWS created by an ASG for the -retry-join value of an agent. When the Consul agent container comes up, we pass the DNS name of the ELB to retry-join and the agent is properly seeded with the values of the Consul servers. It actually works great.

While testing for resiliency, we destroy the entire Consul cluster and see what happens to the agents. The agents return "No known Consul servers" while continuing to try to rejoin the set of IP addresses it originally received from the initial setup, but those instances no longer exist so the agent tries forever. If we simply restart the agent container, it uses retry-join again, finds the new cluster. It even reports the services it was managing, effectively healing itself...huzzah! As the agents are restarted one by one, the service catalog is re-populated and everyone is happy. (Our Consul servers self-populate their KV store upon initialization so that's taken care of by the LC of the ASG)

The feature request is as follows:

Under some set of circumstances, probably by configuration value, the agent will revisit the value in -retry-join instead of relying on its internal IP list while trying to phone home to the servers. This could happen after a timeout once "No known Consul servers" is found, or by setting something along the lines of "join_is_elb" or something better. Our setup would be 100% resilient to wipeouts or other catastrophes if Consul agent would simply re-examine the value given to it in retry-join at some point after exhausting its attempts at contacting all of the servers in its internal list.

Obviously this use case isn't for everyone, which is why it should be explicitly enabled and configurable.

Discuss.

slackpad commented 8 years ago

Hi @spanktar if the new servers coming up join with at least one existing agent from the cluster then word should get around via gossip without having to touch any of the existing agents. There's a change brewing under https://github.com/hashicorp/consul/pull/2459 that should make this easy to set up!

spanktar commented 7 years ago

@slackpad I'll look at that. In the meantime, that's pretty much what I'm getting at. We understand that if we give it one good value, it will gossip (good), but the list given by -retry-join is never revisited, and this can cause problems as outlined above.

I'm very excited about the AWS auto discovery, but I think it might still have the same problem in the case I mentioned.

spanktar commented 7 years ago

Bump. As I start a new Consul deployment with 0.8.5, wondering if this has been addressed? Of course we'll be switching from -retry-join to -retry-join-ec2-* but the same thing could happen there as mentioned in the issue.

Thanks!

RogerReed commented 5 years ago

I'm using AWS retry-join (e.g. provider=aws tag_key=Name tag_value=${CONSUL_CLUSTER_SERVER_NAME}) and also think it would be helpful that when an agent loses its Consul Servers it tries to find them again using tag values instead of returning "No known Consul servers"

hashicorp / consul