Cluster does not attempt to remove inaccessible server node for ~55min

tomer-ds commented 3 years ago

Overview of the Issue

We have a Consul cluster in AWS Tokyo region made of 3 Server nodes spread across AZ's A, C and D. Last Friday (19-02-2021) the entire of AZ-C went down causing the cluster to lose connection the Consul server in that AZ.

The 2 log lines that first show the issue starting:

The last log line from the node in the failed AZ

19 Feb 2021 20:22:39.025[HL-L-CON-2]: 2021-02-19T18:22:35.373Z [DEBUG] agent.server.raft: accepted connection: local-address=172.20.200.36:8300 remote-address=172.20.100.161:60196

The first error related to that node in the cluster:

19 Feb 2021 20:22:39.375[HL-L-CON-1]: 2021-02-19T18:22:34.960Z [INFO] agent.server.raft: aborting pipeline replication: peer="{Voter d7d0a6a5-989c-530f-42d5-75a191b61890 172.20.200.36:8300}"

However the next log for rebalancing of servers shows it is still considered part of the cluster even though there are already errors regarding connectivity:

19 Feb 2021 20:22:44.576[HL-L-CON-1]: 2021-02-19T18:22:40.351Z [DEBUG] agent.server.router.manager: Rebalanced servers, new active server: number_of_servers=3 active_server="HL-L-CON-2.dc1 (Addr: tcp/172.20.200.36:8300) (DC: dc1)"
19 Feb 2021 20:22:49.375[HL-L-CON-1]: 2021-02-19T18:22:45.028Z [ERROR] agent.server.raft: failed to appendEntries to: peer="{Voter d7d0a6a5-989c-530f-42d5-75a191b61890 172.20.200.36:8300}" error="read tcp 172.20.100.161:60053->172.20.200.36:8300: i/o timeout"
19 Feb 2021 20:22:49.576[HL-L-CON-1]: 2021-02-19T18:22:47.139Z [WARN] agent.server.raft: failed to contact: server-id=d7d0a6a5-989c-530f-42d5-75a191b61890 time=2.5000003s

And these errors persist:

19 Feb 2021 20:22:59.383[HL-L-CON-1]: 2021-02-19T18:22:54.535Z [DEBUG] agent.server.raft: failed to contact: server-id=d7d0a6a5-989c-530f-42d5-75a191b61890 time=9.8955255s
19 Feb 2021 20:22:59.576[HL-L-CON-1]: 2021-02-19T18:22:55.038Z [ERROR] agent.server.raft: failed to appendEntries to: peer="{Voter d7d0a6a5-989c-530f-42d5-75a191b61890 172.20.200.36:8300}" error="read tcp 172.20.100.161:60196->172.20.200.36:8300: i/o timeout"
19 Feb 2021 20:22:59.576[HL-L-CON-1]: 2021-02-19T18:22:55.552Z [ERROR] agent.server.raft: failed to heartbeat to: peer=172.20.200.36:8300 error="read tcp 172.20.100.161:60201->172.20.200.36:8300: i/o timeout"
19 Feb 2021 20:22:59.576[HL-L-CON-1]: 2021-02-19T18:22:56.992Z [DEBUG] agent.server.raft: failed to contact: server-id=d7d0a6a5-989c-530f-42d5-75a191b61890 time=12.3525651s

Including the fake rebalancing of all 3 servers both on Clients and Servers: NOTE: at this point there are only 2 servers that are reachable in any way.

19 Feb 2021 20:23:24.205[RTCM-LIVE-SCM1]: 2021-02-19T18:23:21.889Z [DEBUG] agent.client.manager: Rebalanced servers, new active server: number_of_servers=3 active_server="HL-L-CON-2 (Addr: tcp/172.20.200.36:8300) (DC: dc1)"
19 Feb 2021 20:23:26.524[HL-L-CON-3]: 2021-02-19T18:23:22.679Z [DEBUG] agent.server.router.manager: Rebalanced servers, new active server: number_of_servers=3 active_server="HL-L-CON-3.dc1 (Addr: tcp/172.20.150.187:8300) (DC: dc1)"

And the errors get worse:

19 Feb 2021 20:27:14.377[HL-L-CON-1]: 2021-02-19T18:27:10.125Z [ERROR] agent.server.raft: failed to heartbeat to: peer=172.20.200.36:8300 error="read tcp 172.20.100.161:60351->172.20.200.36:8300: i/o timeout"
19 Feb 2021 20:27:14.577[HL-L-CON-1]: 2021-02-19T18:27:10.156Z [ERROR] agent.server.memberlist.lan: memberlist: Push/Pull with HL-L-CON-2 failed: read tcp 172.20.100.161:60352->172.20.200.36:8301: i/o timeout
19 Feb 2021 20:27:14.577[HL-L-CON-1]: 2021-02-19T18:27:10.335Z [DEBUG] agent.server.raft: failed to contact: server-id=d7d0a6a5-989c-530f-42d5-75a191b61890 time=4m25.6957898s
19 Feb 2021 20:27:14.577[HL-L-CON-1]: 2021-02-19T18:27:10.676Z [DEBUG] agent.server.memberlist.lan: memberlist: Stream connection from=172.20.100.129:52036
19 Feb 2021 20:27:14.577[HL-L-CON-1]: 2021-02-19T18:27:12.834Z [DEBUG] agent.server.raft: failed to contact: server-id=d7d0a6a5-989c-530f-42d5-75a191b61890 time=4m28.1948126s
19 Feb 2021 20:27:19.377[HL-L-CON-1]: 2021-02-19T18:27:15.307Z [DEBUG] agent.server.raft: failed to contact: server-id=d7d0a6a5-989c-530f-42d5-75a191b61890 time=4m30.6684523s
19 Feb 2021 20:27:19.577[HL-L-CON-1]: 2021-02-19T18:27:17.723Z [DEBUG] agent.server.raft: failed to contact: server-id=d7d0a6a5-989c-530f-42d5-75a191b61890 time=4m33.0844736s

Then the clients started attempting to query the failed server for leadership and due to lack of response the client application lost leadership election:

19 Feb 2021 20:42:33.511[RTCM-LIVE-SCM2]: 2021-02-19T18:42:30.854Z [ERROR] agent.client.memberlist.lan: memberlist: Push/Pull with HL-L-CON-2 failed: read tcp 172.20.200.210:64656->172.20.200.36:8301: i/o timeout
19 Feb 2021 20:42:33.712[RTCM-LIVE-SCM2]: 2021-02-19T18:42:33.090Z [DEBUG] agent.http: Request finished: method=PUT url=/v1/session/create from=127.0.0.1:64672 latency=12.6932ms
19 Feb 2021 20:42:33.712[RTCM-LIVE-SCM2]: 2021-02-19T18:42:33.093Z [DEBUG] agent.http: Request finished: method=GET url=/v1/kv/service/scm/leader?wait=2s from=127.0.0.1:64672 latency=1.9385ms
...
19 Feb 2021 20:42:59.031[RTCM-LIVE-SCM1]: 2021/02/19 18:42:57 [ERR] yamux: keepalive failed: i/o deadline reached
19 Feb 2021 20:42:59.031[RTCM-LIVE-SCM1]: 2021-02-19T18:42:57.048Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=172.20.200.36:8300 error="rpc error making call: EOF"
19 Feb 2021 20:42:59.031[RTCM-LIVE-SCM1]: 2021-02-19T18:42:57.048Z [ERROR] agent.client: RPC failed to server: method=Session.Renew server=172.20.200.36:8300 error="rpc error making call: EOF"
19 Feb 2021 20:42:59.031[RTCM-LIVE-SCM1]: 2021-02-19T18:42:57.048Z [DEBUG] agent.client.manager: cycled away from server: server="HL-L-CON-2 (Addr: tcp/172.20.200.36:8300) (DC: dc1)"
19 Feb 2021 20:42:59.031[RTCM-LIVE-SCM1]: 2021-02-19T18:42:57.048Z [DEBUG] agent.client.manager: pinging server failed: server="HL-L-CON-2 (Addr: tcp/172.20.200.36:8300) (DC: dc1)" error="rpc error making call: EOF"
19 Feb 2021 20:42:59.031[RTCM-LIVE-SCM1]: 2021-02-19T18:42:57.048Z [ERROR] agent: Coordinate update error: error="rpc error making call: EOF"
19 Feb 2021 20:42:59.031[RTCM-LIVE-SCM1]: 2021-02-19T18:42:57.048Z [DEBUG] agent.tlsutil: OutgoingRPCWrapper: version=1
19 Feb 2021 20:42:59.031[RTCM-LIVE-SCM1]: 2021-02-19T18:42:57.048Z [ERROR] agent.client: RPC failed to server: method=KVS.Get server=172.20.200.36:8300 error="rpc error making call: EOF"
19 Feb 2021 20:42:59.031[RTCM-LIVE-SCM1]: 2021-02-19T18:42:57.048Z [ERROR] agent.http: Request error: method=PUT url=/v1/session/renew/65870a4b-3e1d-7ddd-fd36-e1bf1d6c3f2c from=127.0.0.1:52930 error="rpc error making call: EOF"
19 Feb 2021 20:42:59.031[RTCM-LIVE-SCM1]: 2021-02-19T18:42:57.048Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/service/scm/leader?consistent&index=7189585 from=127.0.0.1:62906 error="rpc error making call: EOF"

But the cluster still continued to think that the failed node was part of it, and didn't even attempt to remove the failed node:

19 Feb 2021 20:42:59.031[RTCM-LIVE-SCM1]: 2021-02-19T18:42:57.048Z [DEBUG] agent.client.manager: Rebalanced servers, new active server: number_of_servers=3 active_server="HL-L-CON-1 (Addr: tcp/172.20.100.161:8300) (DC: dc1)"

Until eventually after quite some time (~55min) the server was recognized as failed and removed forcefully...

https://gist.github.com/tomerMP/db3c03ccde246cd7ef9da1bd8e91c7f4

Judging by the testing I did when first introducing Consul, this behavior is unexpected... From further testing and attempts to reproduce after this issue occurred, Consul is able to successfully remove the failed node from the cluster within ~30 seconds and continue to function with 2 nodes up, in our last test for longer than 2 hours, until we add the fixed node back and then the cluster goes back to normal functionality with 3 nodes.

In this case the fact that the cluster failed to remove the disconnected node for so long seems to have lead to the queries being sent to the failed node and hence the resulting errors and loss of leadership in the client application.

Reproduction Steps

I have been unable to reproduce this. Initially I thought that the configuration 'leave_on_terminate: true' might help, but testing both with and without it I can see it only helps in instances where TERM signal is sent by the Consul service (service stop/server shutdown). So when network is suddenly disconnected (or a similar failure is simulated) the cluster would not get the leave message from the failed node. However, from my tests, the cluster consistently and successfully removes the failed node after a matter of seconds even without this configuration present.

Consul info for both Client and Server

Client info

``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 2cf0a3c8 version = 1.7.1 consul: acl = disabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 16 goroutines = 57 max_procs = 16 os = windows version = go1.13.7 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 10 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 90 members = 5 query_queue = 0 query_time = 1 ```

Server info

``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = 2cf0a3c8 version = 1.7.1 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = false leader_addr = 172.20.100.161:8300 server = true raft: applied_index = 7397897 commit_index = 7397897 fsm_pending = 0 last_contact = 24.411ms last_log_index = 7397897 last_log_term = 47 last_snapshot_index = 7391889 last_snapshot_term = 47 latest_configuration = [{Suffrage:Voter ID:b0e2e661-3cf3-7890-4ac6-56307d1e5ac5 Address:172.20.100.161:8300} {Suffrage:Voter ID:f7340294-6107-b667-6102-f27cec3890af Address:172.20.150.187:8300} {Suffrage:Voter ID:d7d0a6a5-989c-530f-42d5-75a191b61890 Address:172.20.200.36:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 47 runtime: arch = amd64 cpu_count = 2 goroutines = 82 max_procs = 2 os = windows version = go1.13.7 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 10 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 90 members = 5 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 33 members = 3 query_queue = 0 query_time = 1 ```

Operating system and Environment details

Windows server 2012R2, on AWS m5.large instances

tomer-ds commented 3 years ago

Due to the fact that I was unable to reproduce on any of my test environments, I attempted to reproduce this on the same production environment where the issue occured. I simulated the outage by putting my second server (Con-2) behind an AWS security group with no incoming or outgoing rules effectively disconnecting it from the world without a graceful leave signal.

We saw many RPC errors, but the expected Handling event: event=member-leave never came... I ran the test for over 2 hours and at no time did the cluster attempt to force Con-2 server to leave the cluster

tomer-ds commented 3 years ago

Update

it would seem that our assumption was incorrect and that the expected behaviour is that the cluster should continue to attempt to connect to the partitioned node until the reconnecttimeout has been reached (72 hours by deafult)

This then begs the question... Why at 19:16 did we see a member-leave event (<1hr after the incident)

Our new assumtion is that this behaviour is actually the unexpected.

The question then becomes, why would the client have an issue with the fact that a server node has been partitioned and is unreachable? Aside from errors in the logs, I would assume the cluster and clients continue about their day...

hashicorp / consul