Cannot elect leader when cluster nodes up from 1 to 2

MasonXon commented 3 years ago

When filing a bug, please include the following headings if possible. Any example text in this template can be deleted.

Overview of the Issue

When I do the consul auto-election testing, I found that when 2/3 node down, the cluster is stopped to work, I know this is right, when I start a node before stopped, now 2/3 node is started, but the log always output election timeout, and cluster still cannot provides services, with version 1.6.10, everything is ok, start with 1.7.0 is not ok, I tested version 1.6.10、1.7.0、1.9.6、1.10.0, only 1.6.10 work normally

Reproduction Steps

Steps to reproduce this issue, eg:

Create a cluster with 3 server nodes 2.stop 1/3 node 3.stop 2/3 node 4.start 1/2 node before stopped 5.now 2/3 node started 6.cluster is not work, there is no leader

Consul info for both Client and Server

Server info

``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 95fb95bf version = 1.7.0 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = false leader_addr = server = true raft: applied_index = 0 commit_index = 0 fsm_pending = 0 last_contact = never last_log_index = 68 last_log_term = 6 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc Address:10.6.0.21:8300} {Suffrage:Voter ID:d244e694-c619-cf0a-e3d6-701bd510b70d Address:10.6.0.22:8300} {Suffrage:Voter ID:5fc5e757-a1c5-e6f0-ed28-3149d68e44bf Address:10.6.0.23:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Candidate term = 108 runtime: arch = amd64 cpu_count = 2 goroutines = 73 max_procs = 2 os = linux version = go1.12.16 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 5 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 14 members = 2 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 14 members = 2 query_queue = 0 query_time = 1 ```

Operating system and Environment details

cat /etc/centos-release CentOS Linux release 7.9.2009 (Core) uname -a Linux consul-02 3.10.0-1127.el7.x86_64 #1 SMP Tue Mar 31 23:36:51 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Virtual Machine

Log Fragments

consul members Node Address Status Type Build Protocol DC Segment consul-02 10.6.0.22:8301 alive server 1.7.0 2 my-dc-1 consul-03 10.6.0.23:8301 alive server 1.7.0 2 my-dc-1 consul operator raft list-peers Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)

consul-01 now is stopped

consul-02 Jun 25 14:40:27 consul-02 consul: 2021-06-25T14:40:27.168Z [WARN] agent.server.raft: Election timeout reached, restarting election Jun 25 14:40:27 consul-02 consul: 2021-06-25T14:40:27.168Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.6.0.22:8300 [Candidate]" term=122 Jun 25 14:40:27 consul-02 consul: 2021-06-25T14:40:27.169Z [WARN] agent.server.raft: unable to get address for sever, using fallback address: id=bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc fallback=10.6.0.21:8300 error="Could not find address for server id bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc" Jun 25 14:40:27 consul-02 consul: 2021-06-25T14:40:27.169Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc 10.6.0.21:8300}" error="dial tcp ->10.6.0.21:8300: connect: connection refused"

consul-03 Jun 25 14:40:56 consul-03 consul: 2021-06-25T14:40:56.707Z [INFO] agent.server.raft: entering follower state: follower="Node at 10.6.0.23:8300 [Follower]" leader= Jun 25 14:41:00 consul-03 consul: 2021-06-25T14:41:00.100Z [ERROR] agent: Coordinate update error: error="No cluster leader" Jun 25 14:41:01 consul-03 consul: 2021-06-25T14:41:01.921Z [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader= Jun 25 14:41:01 consul-03 consul: 2021-06-25T14:41:01.921Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.6.0.23:8300 [Candidate]" term=128 Jun 25 14:41:01 consul-03 consul: 2021-06-25T14:41:01.922Z [WARN] agent.server.raft: unable to get address for sever, using fallback address: id=bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc fallback=10.6.0.21:8300 error="Could not find address for server id bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc" Jun 25 14:41:01 consul-03 consul: 2021-06-25T14:41:01.923Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc 10.6.0.21:8300}" error="dial tcp ->10.6.0.21:8300: connect: connection refused" Jun 25 14:41:06 consul-03 consul: 2021-06-25T14:41:06.683Z [INFO] agent.server.raft: duplicate requestVote for same term: term=128 Jun 25 14:41:07 consul-03 consul: 2021-06-25T14:41:07.135Z [WARN] agent.server.raft: Election timeout reached, restarting election Jun 25 14:41:07 consul-03 consul: 2021-06-25T14:41:07.135Z [INFO] agent.server.raft: entering candidate state: node="Node at 10.6.0.23:8300 [Candidate]" term=129 Jun 25 14:41:07 consul-03 consul: 2021-06-25T14:41:07.136Z [WARN] agent.server.raft: unable to get address for sever, using fallback address: id=bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc fallback=10.6.0.21:8300 error="Could not find address for server id bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc" Jun 25 14:41:07 consul-03 consul: 2021-06-25T14:41:07.137Z [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter bfb6b7bc-3cbd-6c1a-b3b2-f22e0c705afc 10.6.0.21:8300}" error="dial tcp ->10.6.0.21:8300: connect: connection refused"

MasonXon commented 3 years ago

I found another difference between 1.6.10 and 1.7.0，when I stop a node，the node stopped will not disappear with command consul operator raft list-peers

MasonXon commented 3 years ago

Hi, sorry to bother you. I actually found this problem when simulating a failure scenario, if 2/3 of Consul nodes are down in production, does it mean that you can only restore Consul cluster by repairing all nodes instead of more than half of all nodes to restore service, but strangely, in 1.6.10, everything is working just right

2021年6月30日下午8:11，idrennanvmware @.***> 写道：

You likely have a split brain scenario here - You should go from 1->3 consul nodes (IMO) and this should resolve what you're seeing

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/10516#issuecomment-871348748, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUSMWVLDABRQGFDOWUAAEODTVMCVRANCNFSM47PE4YHQ.

idrennanvmware commented 3 years ago

@xiangma0510 our experience has been that we need to do the reset peers.json option in the scenario you found. We dont have notes back as far as 1.6 so I'm not sure if we were affected the difference you have been

https://learn.hashicorp.com/tutorials/consul/recovery-outage#manual-recovery-using-peers-json

Is the steps we have followed in the past.

MasonXon commented 3 years ago

What’s the reset peer.json？recreate consul server agent？

2021年6月30日下午8:42，idrennanvmware @.***> 写道：

@xiangma0510 https://github.com/xiangma0510 our experience has been that we need to do the reset peers.json option in the scenario you found. We dont have notes back as far as 1.6 so I'm not sure if we were affected the difference you have been

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/10516#issuecomment-871368736, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUSMWVIBKL32EIFYUC27XKDTVMGKDANCNFSM47PE4YHQ.

idrennanvmware commented 3 years ago

I edited my post and added a link, but it's a manual recovery of a cluster. Ideally you can try the earlier steps in the document (but we never really had any success with those once we got in a state where we could no longer get a leader)

MasonXon commented 3 years ago

I checked the link you posted，I will test whether peer.json can restore the cluster. but in different,I stop the consul with the command systemctl stop consul，this is not the same as outage，In my failure consul cluster, the two node started with candidate state, according to your Election Timeout mechanism, the leader should not be unable to be elected，Why can it be health after the third node is started

MasonXon commented 3 years ago

hi,I have tested the method of restoring the cluster using Peers. JSON, which can be used to restore the cluster, but this is only used in the case of an abnormal power outage, and according to the official documentation, this is supposed to be an incomplete way to restore the cluster, which is indeed not suitable for the situation we may encounter. I also need you to explain why the above situation makes the cluster unusable after version 1.6.10,I have tested etcd and zookeeper before and none of the above problems occurred

blake commented 3 years ago

Hi @xiangma0510,

Thank you for sharing the details of your issue. Based on the log output you provided, its still not clear to me what is happening when consul-01 is started again, and why consul-03 fails to establish a quorum with the first server.

That said, the problem you described sounds similar to the scenario detailed in #8118. The reporter of that issue was also unable to recover the cluster after performing a similar shutdown / restart operation on Consul 1.7.0.

https://github.com/hashicorp/consul/issues/8118#issuecomment-645330040 offers a great explanation as to why the cluster was unrecoverable. The solution to that issue was to set the autopilot.min_quorum value equal to desired cluster size (which in your case is 3 servers) so that autopilot does not remove hosts if the number of active servers falls below this value.

Could you also try setting min_quorum in your server configuration, and see if you are able to successfully recover the cluster after simulating node failures?

autopilot {
  min_quorum = 3
}

Thanks.

MasonXon commented 3 years ago

thank for your replay, I will try later

2021年7月14日下午1:22，Blake Covarrubias @.***> 写道：

Hi @xiangma0510 https://github.com/xiangma0510,

Thank you for sharing the details of your issue. Based on the log output you provided, its still not clear to me what is happening when consul-01 is started again, and why consul-03 fails to establish a quorum with the first server.

That said, the problem you described sounds similar to the scenario detailed in #8118 https://github.com/hashicorp/consul/issues/8118. The reporter of that issue was also unable to recover the cluster after performing a similar shutdown / restart operation on Consul 1.7.0.

8118 (comment) https://github.com/hashicorp/consul/issues/8118#issuecomment-645330040 offers a great explanation as to why the cluster was unrecoverable. The solution to that issue was to set the autopilot.min_quorum https://www.consul.io/docs/agent/options#min_quorum value equal to desired cluster size (which in your case is 3 servers) so that autopilot does not remove hosts if the number of active servers falls below this value.

Could you also try setting min_quorum in your server configuration, and see if you are able to successfully recover the cluster after simulating node failures?

autopilot { min_quorum = 3 } Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/10516#issuecomment-879599991, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUSMWVLUDODIRGWIK7DO2L3TXUNIRANCNFSM47PE4YHQ.

MasonXon commented 3 years ago

I have tested the parameter autopilot.min_quorum, and it doesn’t seem to work. Through issue 8118, I basically figured out why this problem occurs, but this parameter determines the minimum number of voters. When my 1/3 node fails At that time, the failed node is still marked as the left state, which will lead to if 2/3 nodes fail later, if the first failed node is added to the cluster first, the cluster still cannot work normally, if the first failure can be guaranteed The node configuration of is not removed from the raft configuration, that is, the failed state is maintained. Is it possible to guarantee that starting any one of the 2/3 failed nodes can ensure the normal operation of the cluster

2021年7月14日下午1:22，Blake Covarrubias @.***> 写道：

Hi @xiangma0510 https://github.com/xiangma0510,

Thank you for sharing the details of your issue. Based on the log output you provided, its still not clear to me what is happening when consul-01 is started again, and why consul-03 fails to establish a quorum with the first server.

That said, the problem you described sounds similar to the scenario detailed in #8118 https://github.com/hashicorp/consul/issues/8118. The reporter of that issue was also unable to recover the cluster after performing a similar shutdown / restart operation on Consul 1.7.0.

8118 (comment) https://github.com/hashicorp/consul/issues/8118#issuecomment-645330040 offers a great explanation as to why the cluster was unrecoverable. The solution to that issue was to set the autopilot.min_quorum https://www.consul.io/docs/agent/options#min_quorum value equal to desired cluster size (which in your case is 3 servers) so that autopilot does not remove hosts if the number of active servers falls below this value.

Could you also try setting min_quorum in your server configuration, and see if you are able to successfully recover the cluster after simulating node failures?

autopilot { min_quorum = 3 } Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/10516#issuecomment-879599991, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUSMWVLUDODIRGWIK7DO2L3TXUNIRANCNFSM47PE4YHQ.

MasonXon commented 3 years ago

Here is my configuration:

datacenter = "my-dc-1" data_dir = "/opt/consul" client_addr = "0.0.0.0" ui_config{ enabled = true } server = true bind_addr = "10.6.0.13" advertise_addr = "10.6.0.13" bootstrap_expect=3 retry_join = ["10.6.0.11","10.6.0.12","10.6.0.13"] autopilot { min_quorum = 3 } leave_on_terminate = false skip_leave_on_interrupt = true

2021年7月20日下午1:04，马翔 @.***> 写道：

I have tested the parameter autopilot.min_quorum, and it doesn’t seem to work. Through issue 8118, I basically figured out why this problem occurs, but this parameter determines the minimum number of voters. When my 1/3 node fails At that time, the failed node is still marked as the left state, which will lead to if 2/3 nodes fail later, if the first failed node is added to the cluster first, the cluster still cannot work normally, if the first failure can be guaranteed The node configuration of is not removed from the raft configuration, that is, the failed state is maintained. Is it possible to guarantee that starting any one of the 2/3 failed nodes can ensure the normal operation of the cluster

2021年7月14日下午1:22，Blake Covarrubias @. @.>> 写道：

Hi @xiangma0510 https://github.com/xiangma0510,

Thank you for sharing the details of your issue. Based on the log output you provided, its still not clear to me what is happening when consul-01 is started again, and why consul-03 fails to establish a quorum with the first server.

That said, the problem you described sounds similar to the scenario detailed in #8118 https://github.com/hashicorp/consul/issues/8118. The reporter of that issue was also unable to recover the cluster after performing a similar shutdown / restart operation on Consul 1.7.0.

8118 (comment) https://github.com/hashicorp/consul/issues/8118#issuecomment-645330040 offers a great explanation as to why the cluster was unrecoverable. The solution to that issue was to set the autopilot.min_quorum https://www.consul.io/docs/agent/options#min_quorum value equal to desired cluster size (which in your case is 3 servers) so that autopilot does not remove hosts if the number of active servers falls below this value.

Could you also try setting min_quorum in your server configuration, and see if you are able to successfully recover the cluster after simulating node failures?

autopilot { min_quorum = 3 } Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/hashicorp/consul/issues/10516#issuecomment-879599991, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUSMWVLUDODIRGWIK7DO2L3TXUNIRANCNFSM47PE4YHQ.

hashicorp / consul