Open tomer-ds opened 3 years ago
Due to the fact that I was unable to reproduce on any of my test environments, I attempted to reproduce this on the same production environment where the issue occured. I simulated the outage by putting my second server (Con-2) behind an AWS security group with no incoming or outgoing rules effectively disconnecting it from the world without a graceful leave signal.
We saw many RPC errors, but the expected Handling event: event=member-leave
never came... I ran the test for over 2 hours and at no time did the cluster attempt to force Con-2 server to leave the cluster
it would seem that our assumption was incorrect and that the expected behaviour is that the cluster should continue to attempt to connect to the partitioned node until the reconnecttimeout
has been reached (72 hours by deafult)
This then begs the question...
Why at 19:16 did we see a member-leave
event (<1hr after the incident)
Our new assumtion is that this behaviour is actually the unexpected.
The question then becomes, why would the client have an issue with the fact that a server node has been partitioned and is unreachable? Aside from errors in the logs, I would assume the cluster and clients continue about their day...
Overview of the Issue
We have a Consul cluster in AWS Tokyo region made of 3 Server nodes spread across AZ's A, C and D. Last Friday (19-02-2021) the entire of AZ-C went down causing the cluster to lose connection the Consul server in that AZ.
The 2 log lines that first show the issue starting:
However the next log for rebalancing of servers shows it is still considered part of the cluster even though there are already errors regarding connectivity:
And these errors persist:
Including the fake rebalancing of all 3 servers both on Clients and Servers: NOTE: at this point there are only 2 servers that are reachable in any way.
And the errors get worse:
Then the clients started attempting to query the failed server for leadership and due to lack of response the client application lost leadership election:
But the cluster still continued to think that the failed node was part of it, and didn't even attempt to remove the failed node:
Until eventually after quite some time (~55min) the server was recognized as failed and removed forcefully...
https://gist.github.com/tomerMP/db3c03ccde246cd7ef9da1bd8e91c7f4
Judging by the testing I did when first introducing Consul, this behavior is unexpected... From further testing and attempts to reproduce after this issue occurred, Consul is able to successfully remove the failed node from the cluster within ~30 seconds and continue to function with 2 nodes up, in our last test for longer than 2 hours, until we add the fixed node back and then the cluster goes back to normal functionality with 3 nodes.
In this case the fact that the cluster failed to remove the disconnected node for so long seems to have lead to the queries being sent to the failed node and hence the resulting errors and loss of leadership in the client application.
Reproduction Steps
I have been unable to reproduce this. Initially I thought that the configuration 'leave_on_terminate: true' might help, but testing both with and without it I can see it only helps in instances where TERM signal is sent by the Consul service (service stop/server shutdown). So when network is suddenly disconnected (or a similar failure is simulated) the cluster would not get the leave message from the failed node. However, from my tests, the cluster consistently and successfully removes the failed node after a matter of seconds even without this configuration present.
Consul info for both Client and Server
Client info
``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 2cf0a3c8 version = 1.7.1 consul: acl = disabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 16 goroutines = 57 max_procs = 16 os = windows version = go1.13.7 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 10 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 90 members = 5 query_queue = 0 query_time = 1 ```Server info
``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 2cf0a3c8 version = 1.7.1 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = false leader_addr = 172.20.100.161:8300 server = true raft: applied_index = 7397897 commit_index = 7397897 fsm_pending = 0 last_contact = 24.411ms last_log_index = 7397897 last_log_term = 47 last_snapshot_index = 7391889 last_snapshot_term = 47 latest_configuration = [{Suffrage:Voter ID:b0e2e661-3cf3-7890-4ac6-56307d1e5ac5 Address:172.20.100.161:8300} {Suffrage:Voter ID:f7340294-6107-b667-6102-f27cec3890af Address:172.20.150.187:8300} {Suffrage:Voter ID:d7d0a6a5-989c-530f-42d5-75a191b61890 Address:172.20.200.36:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 47 runtime: arch = amd64 cpu_count = 2 goroutines = 82 max_procs = 2 os = windows version = go1.13.7 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 10 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 90 members = 5 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 33 members = 3 query_queue = 0 query_time = 1 ```Operating system and Environment details
Windows server 2012R2, on AWS m5.large instances