hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.45k stars 4.43k forks source link

recovering from leader loss: neither force-leave nor cleanup_dead_servers works #3151

Open aep opened 7 years ago

aep commented 7 years ago

consul version for both Client and Server

Both Consul v0.8.4

consul info for both Client and Server

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 0
    services = 2
build:
    prerelease = 
    revision = f436077
    version = 0.8.4
consul:
    bootstrap = false
    known_datacenters = 1
    leader = false
    leader_addr = 
    server = true
raft:
    applied_index = 144
    commit_index = 144
    fsm_pending = 0
    last_contact = 7m32.065037683s
    last_log_index = 146
    last_log_term = 3
    last_snapshot_index = 0
    last_snapshot_term = 0
    latest_configuration = [{Suffrage:Voter ID:f5ac46b4-299d-23f5-6a25-dbb1e0565a58 Address:172.30.50.96:8300} {Suffrage:Voter ID:000449be-6dbb-e34e-993f-a9f8334ea0f9 Address:172.30.31.217:8300} {Suffrage:Nonvoter ID:cb832411-bd03-bd2c-3b2e-40badd7c969e Address:172.30.2.254:8300}]
    latest_configuration_index = 145
    num_peers = 1
    protocol_version = 3
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Candidate
    term = 62
runtime:
    arch = amd64
    cpu_count = 1
    goroutines = 72
    max_procs = 1
    os = linux
    version = go1.8.3
serf_lan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 3
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 3
    member_time = 33
    members = 7
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 3
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 30
    members = 7
    query_queue = 0
    query_time = 1

Operating system and Environment details

official docker image 37ffadd9b8a6

Description of the Issue (and unexpected/desired result)

loss of quorum cannot be recovered unless the replacement instances have the same ip.

Reproduction steps

  1. start 5 aws instances with --expect-bootstrap 5
  2. wait for consensus.
  3. murder any 3 of them and respin 3 different ones with different ip
  4. raft: Failed to make RequestVote RPC to forever.

force-leave has no effect, neither does CleanupDeadServers = true change anything.

slackpad commented 7 years ago

Hi @aep if you lose 3 out of 5 servers, quorum will be lost and Consul can no longer make changes to the cluster on its own. At that point you will need to perform the steps in https://www.consul.io/docs/guides/outage.html#failure-of-multiple-servers-in-a-multi-server-cluster in order to recover the cluster manually, and introduce the 3 new servers into the configuration.

slackpad commented 7 years ago

I should also point to https://www.consul.io/docs/guides/autopilot.html#dead-server-cleanup, which will remove dead servers as new ones are added to replace them. For a normal ASG-like setup with automatic replacement, this should handle even unclean server fails automatically.

aep commented 7 years ago

@slackpad that was what I was reporting. That option was essentially what I thought would cover my use case. (Asg) But it has no effect. I don't understand why loss of quorum is a special case and why the option is disabled under those conditions.

slackpad commented 7 years ago

With Raft all of the configuration changes to the quorum (adding and removing servers) go through the Raft protocol itself, so if that's unable to make any changes because of an outage due to loss of quorum, there's not a good way to have the servers recover on their own. We have considered some options where we might have some special operator APIs and CLIs to allow you to do the peers.json type recovery without actually shutting down Consul and placing those files, but it was cumbersome to build and use, given that the cluster is in an outage state. We may revisit that in the future, but it's not currently planned.

With the latest version of Consul w/Autopilot and an ASG, Consul will keep things clean in the face of failures as long as you don't lose quorum, so you'd want to have enough servers to cover the number of simultaneous failures you want to be able to handle (3 servers can handle 1 failure, 5 can handle 2, etc.).

aep commented 7 years ago

Thanks for the explanation. It would be helpful if the docs stated that right under the dead server cleanup option.

Something like "This is only possible when less than a quorum-loss amount of servers left, otherwise the cluster will be in outage state [link] where it does not remove dead nodes"

I'm not sure but for the outage state, wouldn't it be possible to just rediscover the memberlist from -retry-join-ec2 ? This would yield the same result on all nodes and should be fairly simple to implement.

slackpad commented 7 years ago

I'll kick this open to track updating the docs!

I'm not sure but for the outage state, wouldn't it be possible to just rediscover the memberlist from -retry-join-ec2 ? This would yield the same result on all nodes and should be fairly simple to implement.

We could add some tooling for operators to do something like this, and we've looked at it, but it gets tricky since you don't have a quorum you need the tool to contact each server and perform the change at the same time, so it's hard to craft that into a robust experience.

aep commented 7 years ago

"since you don't have a quorum you need the tool to contact each server and"

not sure why. ec2 api guarantees consistency, so each server could simply assume this is the truth and proceed without asking anyone else.

slackpad commented 7 years ago

not sure why. ec2 api guarantees consistency, so each server could simply assume this is the truth and proceed without asking anyone else.

You'd still need to initiate this operation across all the servers (or have them somehow coordinate to kick it off after an outage). You wouldn't want some subset of servers running this on their own and splitting off into a separate cluster since they got partitioned while some of the other servers were waiting and still part of the old cluster, for example.