Closed tionebsalocin closed 3 years ago
There are two places in raft where a node enters Candidate
state:
non-voters
should be prevented from getting there. There should be a guard in place that prevents non-voters to entering that state.Does that seem likely to you?
https://github.com/hashicorp/raft/pull/398 makes pickServer
return only voters which fixes the second issue I mentioned:
https://github.com/hashicorp/raft/blob/72a6ed385dac0c8ee1628d12a9f0bb99abf39f19/raft.go#L1785 is another problem I filed a ticket for: #7031 (because it was introduced with 1.5.2).
I checked the first issue again:
https://github.com/hashicorp/raft/blob/72a6ed385dac0c8ee1628d12a9f0bb99abf39f19/raft.go#L216: is the problem, because I don't see how non-voters should be prevented from getting there. There should be a guard in place that prevents non-voters to entering that state.
And I am not sure anymore if this is the problem. There is a check before that makes contains !hasVote
which checks for the voter. I am not sure anymore what the problem is here.
I saw the same issue today: a non voter went visible as Leader for a few seconds, then another took its role instead (Consul 1.6.5)
We had the issue several times recently:
Eg: 5 voter servers + 6 non-voters When we reinstall non-voters:
{ "non_voter": true}
consul operator raft list-peers
shows consul22 as voting while it should not.consul leave
command on consul22consul operator raft list-peers
=> shows non-voting (good!)consul leave
on consul22 and keep consul shutdown + do a service consul restart
on all voting servers 1 by 1, so those servers DO not see the old machine22
in their consul members
, once none see machine22
in member list, situation is back to normalSo, I suspect that other consul servers see machine22[nodeID=22] dead without tag non_voter=1 and decide that consul22[nodeID=22] alive with tag non_voter=true are the same machines and mismatches the tags
So, until machine22[nodeID=22]
is in member list, the other servers do no consider consul22
as a non-voter.
Example from real world (from misconfiguration at start):
consul28-par 10.236.91.29:8301 left acls=1,build=1.6.6+entcriteo1:,dc=par,expect=3,id=d67f58ed-f589-8f63-830a-2fe6b9f277b4,nonvoter=1,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
consul28-par.central.criteo.prod 10.236.91.29:8301 alive acls=1,build=1.6.6+entcriteo1:,dc=par,expect=3,id=d67f58ed-f589-8f63-830a-2fe6b9f277b4,nonvoter=1,port=8300,raft_vsn=3,role=consul,segment=<all>,use_tls=1,vsn=2,vsn_max=3,vsn_min=2,wan_join_port=8302
48-df-37-42-78-40.central.criteo.prod 10.236.91.29:8301 left acls=1,build=1.6.6criteo1:,dc=par,id=d67f58ed-f589-8f63-830a-2fe6b9f277b4,role=node,segment=<default>,vsn=2,vsn_max=3,vsn_min=2
Of course, we can see that 48-df-37-42-78-40.central.criteo.prod has not the non_voter flag, and I suspect that's the root cause of it
=> Until I restart all consul voters (in order to remove consul28-par and 48-df-37-42-78-40.central.criteo.prod), I have those messages:
2020/07/24 12:33:20 [WARN] raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:23 [WARN] raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:29 [WARN] raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:34 [WARN] raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:38 [WARN] raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:43 [WARN] raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:48 [WARN] raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
2020/07/24 12:33:52 [WARN] raft: Rejecting vote request from 10.236.91.29:8300 since we have a leader: 10.236.42.13:8300
Hi all!
I brought this issue up the other day with Consul Engineering and we determine that this is basically working as intended, because OSS servers aren't aware of Ent-specific functionality, so they shouldn't ever be mixed. Given that, I'll close this one out. But do reach out if you have any other questions!
Overview of the Issue
In a cluster running open source version of consul, I've added an enterprise version server with the option
non_voting_server
. After few election, this server happened to be leader of the cluster which was not, obviously, expected. It happend in production while we were introducing progressively the enterprise version server. The non voting option was set to avoid any impact at the begining of the migration. Note: bellow logs and info comes from a test platform on which we reproduced the issueReproduction Steps
Steps to reproduce this issue, eg:
Consul info for both OSS and ENT Server
OSS Server info
``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = criteo8 revision = version = 1.5.1 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = true leader_addr = 172.21.185.218:8300 server = true raft: applied_index = 179 commit_index = 179 fsm_pending = 0 last_contact = 0 last_log_index = 179 last_log_term = 24 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Nonvoter ID:3199fa0a-f10b-c3af-37f0-2ad73553769a Address:172.21.180.196:8300} {Suffrage:Voter ID:a411cf76-a4f9-ae24-3afd-5e642763eed0 Address:172.21.185.218:8300} {Suffrage:Voter ID:4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f Address:172.21.233.101:8300}] latest_configuration_index = 173 num_peers = 1 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 24 runtime: arch = amd64 cpu_count = 2 goroutines = 100 max_procs = 2 os = linux version = go1.11.13 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 9 failed = 0 health_score = 2 intent_queue = 0 left = 0 member_time = 12 members = 6 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 4 members = 3 query_queue = 0 query_time = 1 ```Enterprise Server info
``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = criteo9 revision = version = 1.5.1+ent consul: acl = disabled bootstrap = false known_datacenters = 1 leader = true leader_addr = 172.21.180.196:8300 server = true license: customer = permanent expiration_time = 2049-12-15 13:31:19.868306147 +0000 UTC features = Automated Backups, Automated Upgrades, Enhanced Read Scalability, Network Segments, Redundancy Zone, Advanced Network Federation id = permanent install_id = * issue_time = 2019-12-23 13:31:19.868306147 +0000 UTC package = premium product = consul start_time = 2019-12-23 13:26:19.868306147 +0000 UTC raft: applied_index = 429 commit_index = 429 fsm_pending = 0 last_contact = 0 last_log_index = 429 last_log_term = 36 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:a411cf76-a4f9-ae24-3afd-5e642763eed0 Address:172.21.185.218:8300} {Suffrage:Voter ID:4fa1d84f-dfed-f72b-e9e9-ce6ccb0cd64f Address:172.21.233.101:8300} {Suffrage:Voter ID:3199fa0a-f10b-c3af-37f0-2ad73553769a Address:172.21.180.196:8300}] latest_configuration_index = 295 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 36 runtime: arch = amd64 cpu_count = 2 goroutines = 117 max_procs = 2 os = linux version = go1.13.4 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 14 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 21 members = 6 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 5 members = 3 query_queue = 0 query_time = 1 ```Operating system and Environment details
In this exemple 3 VM running ubuntu with 2 consul 1.5.1 OSS and 1 consul 1.5.1 ENT OSS servers are: consul-server000-dc0 build=1.5.1criteo8 role=consul consul-server001-dc0 build=1.5.1criteo8 role=consul ENT server is: consul-server002-dc0 build=1.5.1+entcriteo9 nonvoter=1 role=consul
consul members + info fragement:
consul members detailed for these servers
Non voting server is leader!
Log Fragments
Election log
Non voting server gets elected