hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.25k stars 4.41k forks source link

Non voting servers get elected as leader in a mixed OSS/Enterprise version cluster #6979

Closed tionebsalocin closed 3 years ago

tionebsalocin commented 4 years ago

Overview of the Issue

In a cluster running open source version of consul, I've added an enterprise version server with the option non_voting_server. After few election, this server happened to be leader of the cluster which was not, obviously, expected. It happend in production while we were introducing progressively the enterprise version server. The non voting option was set to avoid any impact at the begining of the migration. Note: bellow logs and info comes from a test platform on which we reproduced the issue

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create a cluster with 3 client nodes and 2 server nodes running 1.5.1 version OSS and 1 running 1.5.1 enterprise version
  2. Add the non_voting_server option in the enterprise version config file (and restart consul on this server)
  3. On one if the 2 servers running OSS stop consul to provoque election
  4. Loop until the Enterprise version server gets elected

Consul info for both OSS and ENT Server

OSS Server info ``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = criteo8 revision = version = 1.5.1 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = true leader_addr = 172.21.185.218:8300 server = true raft: applied_index = 179 commit_index = 179 fsm_pending = 0 last_contact = 0 last_log_index = 179 last_log_term = 24 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Nonvoter ID:3199fa0a-f10b-c3af-37f0-2ad73553769a Address:172.21.180.196:8300} {Suffrage:Voter ID:a411cf76-a4f9-ae24-3afd-5e642763eed0 Address:172.21.185.218:8300} {Suffrage:Voter ID:4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f Address:172.21.233.101:8300}] latest_configuration_index = 173 num_peers = 1 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 24 runtime: arch = amd64 cpu_count = 2 goroutines = 100 max_procs = 2 os = linux version = go1.11.13 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 9 failed = 0 health_score = 2 intent_queue = 0 left = 0 member_time = 12 members = 6 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 4 members = 3 query_queue = 0 query_time = 1 ```
Enterprise Server info ``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = criteo9 revision = version = 1.5.1+ent consul: acl = disabled bootstrap = false known_datacenters = 1 leader = true leader_addr = 172.21.180.196:8300 server = true license: customer = permanent expiration_time = 2049-12-15 13:31:19.868306147 +0000 UTC features = Automated Backups, Automated Upgrades, Enhanced Read Scalability, Network Segments, Redundancy Zone, Advanced Network Federation id = permanent install_id = * issue_time = 2019-12-23 13:31:19.868306147 +0000 UTC package = premium product = consul start_time = 2019-12-23 13:26:19.868306147 +0000 UTC raft: applied_index = 429 commit_index = 429 fsm_pending = 0 last_contact = 0 last_log_index = 429 last_log_term = 36 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:a411cf76-a4f9-ae24-3afd-5e642763eed0 Address:172.21.185.218:8300} {Suffrage:Voter ID:4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f Address:172.21.233.101:8300} {Suffrage:Voter ID:3199fa0a-f10b-c3af-37f0-2ad73553769a Address:172.21.180.196:8300}] latest_configuration_index = 295 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 36 runtime: arch = amd64 cpu_count = 2 goroutines = 117 max_procs = 2 os = linux version = go1.13.4 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 14 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 21 members = 6 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 5 members = 3 query_queue = 0 query_time = 1 ```

Operating system and Environment details

In this exemple 3 VM running ubuntu with 2 consul 1.5.1 OSS and 1 consul 1.5.1 ENT OSS servers are: consul-server000-dc0 build=1.5.1criteo8 role=consul consul-server001-dc0 build=1.5.1criteo8 role=consul ENT server is: consul-server002-dc0 build=1.5.1+entcriteo9 nonvoter=1 role=consul

consul members + info fragement:

consul-server000-dc0  172.21.185.218:8301  alive   server  1.5.1criteo8      2         dc0  <all>
consul-server001-dc0  172.21.233.101:8301  alive   server  1.5.1criteo8      2         dc0  <all>
consul-server002-dc0  172.21.180.196:8301  alive   server  1.5.1+entcriteo9  2         dc0  <all>
    leader = false
    leader_addr = 172.21.180.196:8300

consul members detailed for these servers

Node                  Address              Status  Tags
consul-server000-dc0  172.21.185.218:8301  alive   acls=0,build=1.5.1criteo8:,criteo_grpc_enabled=true,dc=dc0,expect=3,grpc_enabled=true,id=a411cf76-a4f9-ae24-3afd-5e642763eed0,port=8300,raft_vsn=3,role=consul,segment=<all>,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
consul-server001-dc0  172.21.233.101:8301  alive   acls=0,build=1.5.1criteo8:,criteo_grpc_enabled=true,dc=dc0,expect=3,grpc_enabled=true,id=4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f,port=8300,raft_vsn=3,role=consul,segment=<all>,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
consul-server002-dc0  172.21.180.196:8301  alive   acls=0,build=1.5.1+entcriteo9:,criteo_grpc_enabled=true,dc=dc0,expect=3,grpc_enabled=true,id=3199fa0a-f10b-c3af-37f0-2ad73553769a,nonvoter=1,port=8300,raft_vsn=3,role=consul,segment=<all>,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302

Non voting server is leader!

Log Fragments

Election log

2019/12/23 13:31:55 [WARN] raft: Heartbeat timeout from "172.21.185.218:8300" reached, starting election
2019/12/23 13:31:55 [INFO] raft: Node at 172.21.180.196:8300 [Candidate] entering Candidate state in term 35
2019/12/23 13:31:55 [ERR] raft: Failed to make RequestVote RPC to {Voter a411cf76-a4f9-ae24-3afd-5e642763eed0 172.21.185.218:8300}: EOF
2019/12/23 13:31:55 [DEBUG] raft: Votes needed: 2
2019/12/23 13:31:55 [DEBUG] raft: Vote granted from 3199fa0a-f10b-c3af-37f0-2ad73553769a in term 35. Tally: 1
2019/12/23 13:31:56 [WARN] raft: Election timeout reached, restarting election
2019/12/23 13:31:56 [INFO] raft: Node at 172.21.180.196:8300 [Candidate] entering Candidate state in term 36
2019/12/23 13:31:56 [ERR] raft: Failed to make RequestVote RPC to {Voter a411cf76-a4f9-ae24-3afd-5e642763eed0 172.21.185.218:8300}: dial tcp <nil>->172.21.185.218:8300: connect: connection refused
2019/12/23 13:31:56 [DEBUG] raft: Votes needed: 2
2019/12/23 13:31:56 [DEBUG] raft: Vote granted from 3199fa0a-f10b-c3af-37f0-2ad73553769a in term 36. Tally: 1
2019/12/23 13:31:56 [DEBUG] raft: Vote granted from 4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f in term 36. Tally: 2
2019/12/23 13:31:56 [INFO] raft: Election won. Tally: 2
2019/12/23 13:31:56 [INFO] raft: Node at 172.21.180.196:8300 [Leader] entering Leader state
2019/12/23 13:31:56 [INFO] raft: Added peer a411cf76-a4f9-ae24-3afd-5e642763eed0, starting replication
2019/12/23 13:31:56 [INFO] raft: Added peer 4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f, starting replication
2019/12/23 13:31:56 [INFO] consul: cluster leadership acquired
2019/12/23 13:31:56 [INFO] consul: New leader elected: consul-server002-dc0

Non voting server gets elected

hanshasselberg commented 4 years ago

There are two places in raft where a node enters Candidate state:

Does that seem likely to you?

hanshasselberg commented 4 years ago

https://github.com/hashicorp/raft/pull/398 makes pickServer return only voters which fixes the second issue I mentioned:

https://github.com/hashicorp/raft/blob/72a6ed385dac0c8ee1628d12a9f0bb99abf39f19/raft.go#L1785 is another problem I filed a ticket for: #7031 (because it was introduced with 1.5.2).

I checked the first issue again:

https://github.com/hashicorp/raft/blob/72a6ed385dac0c8ee1628d12a9f0bb99abf39f19/raft.go#L216: is the problem, because I don't see how non-voters should be prevented from getting there. There should be a guard in place that prevents non-voters to entering that state.

And I am not sure anymore if this is the problem. There is a check before that makes contains !hasVote which checks for the voter. I am not sure anymore what the problem is here.

pierresouchay commented 4 years ago

I saw the same issue today: a non voter went visible as Leader for a few seconds, then another took its role instead (Consul 1.6.5)

pierresouchay commented 4 years ago

We had the issue several times recently:

Eg: 5 voter servers + 6 non-voters When we reinstall non-voters:

  1. The non-voter (node-id=machineID22), let's call it consul22, leave the cluster with a Consul leave, reboot and its image is erased
  2. The non-voter machine is being reinstalled, with a new name, lets call it machine22, a consul-agent configuration and the node-id: machineID22
  3. Once fully re-installed, the machine is renamed "consul22" still with the same ID, we also add a configuration { "non_voter": true}
  4. At that time, in the Consul cluster, consul members shows both machine22 (marked as dead) and consul22 (marked alive)
  5. Once started and having it replication, consul22 starts voting and triggers lots of elections, performance degrades
  6. At that time, consul operator raft list-peers shows consul22 as voting while it should not.
  7. Doing a consul leave command on consul22
  8. Restart consul on consul22 => consul operator raft list-peers => shows non-voting (good!)
  9. Still, consul22 triggers elections and performance degrades
  10. The only way to fix reliably the problem is to to a consul leave on consul22 and keep consul shutdown + do a service consul restart on all voting servers 1 by 1, so those servers DO not see the old machine22 in their consul members, once none see machine22 in member list, situation is back to normal

So, I suspect that other consul servers see machine22[nodeID=22] dead without tag non_voter=1 and decide that consul22[nodeID=22] alive with tag non_voter=true are the same machines and mismatches the tags

So, until machine22[nodeID=22] is in member list, the other servers do no consider consul22 as a non-voter.

pierresouchay commented 4 years ago

Example from real world (from misconfiguration at start):

consul28-par                                                   10.236.91.29:8301    left    acls=1,build=1.6.6+entcriteo1:,dc=par,expect=3,id=d67f58ed-f589-8f63-830a-2fe6b9f277b4,nonvoter=1,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
consul28-par.central.criteo.prod                               10.236.91.29:8301    alive   acls=1,build=1.6.6+entcriteo1:,dc=par,expect=3,id=d67f58ed-f589-8f63-830a-2fe6b9f277b4,nonvoter=1,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
48-df-37-42-78-40.central.criteo.prod                          10.236.91.29:8301    left    acls=1,build=1.6.6criteo1:,dc=par,id=d67f58ed-f589-8f63-830a-2fe6b9f277b4,role=node,segment=<default>,vsn=2,vsn_max=3,vsn_min=2

Of course, we can see that 48-df-37-42-78-40.central.criteo.prod has not the non_voter flag, and I suspect that's the root cause of it

=> Until I restart all consul voters (in order to remove consul28-par and 48-df-37-42-78-40.central.criteo.prod), I have those messages:

2020/07/24 12:33:20 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:23 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:29 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:34 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:38 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:43 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:48 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:52 [WARN]  raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
ChipV223 commented 3 years ago

Hi all!

I brought this issue up the other day with Consul Engineering and we determine that this is basically working as intended, because OSS servers aren't aware of Ent-specific functionality, so they shouldn't ever be mixed. Given that, I'll close this one out. But do reach out if you have any other questions!