Docker Pause is causing consul remote site failure

Prasanth131093 commented 1 year ago

Overview of the Issue

We faced an consul remote queries failure during consul wan connectivity testing.

Reproduction Steps

Below are the steps to reproduce the issue.

1.ungraceful stop of 2 consul containers on DC1 (using docker pause command / Sending SIGTOP signal to stop consul process) 2.After Sometime we are forcefully making the remaining one node as leader. Post that DC1 cluster is fine with one leader node.

[root@*** bin]# consul operator raft list-peers
Node ID Address State Voter RaftProtocol
ConsulPri3 **** ****:8300 leader true 3

DC2 cluster is running fine with 3 node cluster

During this scenario any request from Geo site to primary is not working, It is getting timed out by reaching out to the paused DC1 consul container.

Request from DC2 to DC1:

curl http://127.0.0.1:8500/v1/operator/autopilot/health?dc=1 --max-time 20
curl: (28) Operation timed out after 20001 milliseconds with 0 out of -1 bytes received

The same query is working fine locally from DC1. During this scenario the sending request is failing only from DC2 to DC1.

Consul WAN list details

Below is the info which I took after reproducing the issue.

DC1 Wan details:

[***@ DC1 ]# consul members -wan -detailed
Node Address Status Tags
DC1server1 (DC1server1IP) :8302 alive acls=1,ap=default,bootstrap=1,build=1.12.0:09a8cdb4,dc=1,ft_fs=1,ft_si=1,id=52019630-880a-6f66-827f-c3afabeb5b13,port=8300,raft_vsn=3,role=consul,segment=,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
DC2server1 (DC2server1IP):8302 alive acls=1,ap=default,bootstrap=1,build=1.12.0:09a8cdb4,dc=2,ft_fs=1,ft_si=1,id=b7d8ee37-c0c4-098e-7531-d23da4a6b704,port=8300,raft_vsn=3,role=consul,segment=,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
DC2server2 (DC2server2IP) :8302 alive acls=1,ap=default,build=1.12.0:09a8cdb4,dc=2,ft_fs=1,ft_si=1,id=f44ed1f2-2347-868d-599d-240e3a26f6f8,port=8300,raft_vsn=3,role=consul,segment=,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
DC2server3 (DC3server1IP) :8302 alive acls=1,ap=default,build=1.12.0:09a8cdb4,dc=2,ft_fs=1,ft_si=1,id=6666389a-3a9e-c4fc-3c9f-207f69182b04,port=8300,raft_vsn=3,role=consul,segment=,use_tls=1,vsn=2,vsn_max=3,vsn_min=2

DC2 wan details: (Seems From DC2 stopped nodes are not cleaned up)

[ **@ DC2 ]# consul members -wan -detailed
Node Address Status Tags
DC1server1 (DC1server1IP) :8302 alive acls=1,ap=default,bootstrap=1,build=1.12.0:09a8cdb4,dc=1,ft_fs=1,ft_si=1,id=52019630-880a-6f66-827f-c3afabeb5b13,port=8300,raft_vsn=3,role=consul,segment=,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
DC1server2 (DC1server2IP) :8302 failed acls=1,ap=default,build=1.12.0:09a8cdb4,dc=1,ft_fs=1,ft_si=1,id=33033aae-ff2a-cd67-0e47-e50d3704c6bd,port=8300,raft_vsn=3,role=consul,segment=,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
DC1server3 (DC1server13IP) :8302 failed acls=1,ap=default,build=1.12.0:09a8cdb4,dc=1,ft_fs=1,ft_si=1,id=6cf087c7-af10-b1e0-0d3b-f825222d1d2d,port=8300,raft_vsn=3,role=consul,segment=,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
DC2server1 (DC2server1IP) :8302 alive acls=1,ap=default,bootstrap=1,build=1.12.0:09a8cdb4,dc=2,ft_fs=1,ft_si=1,id=b7d8ee37-c0c4-098e-7531-d23da4a6b704,port=8300,raft_vsn=3,role=consul,segment=,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
DC2server2 (DC2server2IP) :8302 alive acls=1,ap=default,build=1.12.0:09a8cdb4,dc=2,ft_fs=1,ft_si=1,id=f44ed1f2-2347-868d-599d-240e3a26f6f8,port=8300,raft_vsn=3,role=consul,segment=,use_tls=1,vsn=2,vsn_max=3,vsn_min=2
DC2server3 (DC2server3IP) :8302 alive acls=1,ap=default,build=1.12.0:09a8cdb4,dc=2,ft_fs=1,ft_si=1,id=6666389a-3a9e-c4fc-3c9f-207f69182b04,port=8300,raft_vsn=3,role=consul,segment=,use_tls=1,vsn=2,vsn_max=3,vsn_min=2

Logs from DC2

2023-05-18T15:47:50.731+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server2 DC1server2IP :8302
2023-05-18T15:48:50.734+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server2 DC1server2IP :8302
2023-05-18T15:50:50.738+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server3 DC1server3IP :8302
2023-05-18T15:51:50.741+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server2 DC1server2IP :8302
2023-05-18T15:52:50.744+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server3 DC1server3IP :8302
2023-05-18T15:54:20.748+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server3 DC1server3IP :8302
2023-05-18T15:55:20.750+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server3 DC1server3IP :8302
2023-05-18T15:56:50.752+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server3 DC1server3IP :8302
2023-05-18T15:57:50.758+0530 [INFO] agent.server.serf.wan: serf: attempting reconnect to DC1server2 DC1server2IP :8302

No suspicious Logs from DC1 related to WAN join

This issue is discussed in consul forum too : https://discuss.hashicorp.com/t/docker-pause-is-causing-consul-remote-site-failure/53856/2

maxb commented 1 year ago

Having been involved in the conversation over in the Discuss thread, I was curious whether I could reproduce the issue.

I did my testing with the latest Consul 1.15.2 since that's what I already had installed.

I started up two 3-node Consul datacenters with all instances running on my laptop. I froze two of the dc1 nodes using Ctrl-Z. I shut down the remaining dc1 node, and performed peers.json Raft recovery to make it the sole active node. I attempted to query dc1 via dc2, using consul catalog services -datacenter=dc1 directed to a dc2 server throughout.

I had mixed results reproducing @Prasanth131093 's results:

I observed that Consul in dc2 spent an undesirably long time (1-2 minutes) trying to communicate with dead servers in dc1 before it figured out which node was up.
- Unlike @Prasanth131093 , my Consul setup did recover to the point of talking to the one remaining dc1 server after a while, though
I observed that the failed servers were only being removed from other individual server's Serf WAN pools, when those other individual servers where restarted. That explains the differing information observed from consul members -wan.

The most relevant log lines regarding which other Consul servers Consul is talking to appear to be the [DEBUG] agent.router.manager: ones.

@Prasanth131093 : Since I was unable to reproduce the issue based on your description, please could you re-run your tests using Consul 1.15.2 to see if it is reproducible for you? Please pay special attention to any agent.router log lines.

Prasanth131093 commented 1 year ago

Hi,

Before that can you please docker pause command / using SIGSTOP Signal to Freeze the consul process? I could reproduce the issue only with this step. Other force killing methods are not reproducing this issue.

Requesting you to try with docker pause / SIGSTOP Signal.

hashicorp / consul