Open Prasanth131093 opened 1 year ago
Having been involved in the conversation over in the Discuss thread, I was curious whether I could reproduce the issue.
I did my testing with the latest Consul 1.15.2 since that's what I already had installed.
I started up two 3-node Consul datacenters with all instances running on my laptop.
I froze two of the dc1 nodes using Ctrl-Z.
I shut down the remaining dc1 node, and performed peers.json Raft recovery to make it the sole active node.
I attempted to query dc1 via dc2, using consul catalog services -datacenter=dc1
directed to a dc2 server throughout.
I had mixed results reproducing @Prasanth131093 's results:
consul members -wan
.The most relevant log lines regarding which other Consul servers Consul is talking to appear to be the [DEBUG] agent.router.manager:
ones.
@Prasanth131093 : Since I was unable to reproduce the issue based on your description, please could you re-run your tests using Consul 1.15.2 to see if it is reproducible for you? Please pay special attention to any agent.router
log lines.
Hi,
Before that can you please docker pause command / using SIGSTOP Signal to Freeze the consul process? I could reproduce the issue only with this step. Other force killing methods are not reproducing this issue.
Requesting you to try with docker pause / SIGSTOP Signal.
Overview of the Issue
We faced an consul remote queries failure during consul wan connectivity testing.
Reproduction Steps
Below are the steps to reproduce the issue.
1.ungraceful stop of 2 consul containers on DC1 (using docker pause command / Sending SIGTOP signal to stop consul process) 2.After Sometime we are forcefully making the remaining one node as leader. Post that DC1 cluster is fine with one leader node.
DC2 cluster is running fine with 3 node cluster
During this scenario any request from Geo site to primary is not working, It is getting timed out by reaching out to the paused DC1 consul container.
Request from DC2 to DC1:
The same query is working fine locally from DC1. During this scenario the sending request is failing only from DC2 to DC1.
Consul WAN list details
Below is the info which I took after reproducing the issue.
DC1 Wan details:
DC2 wan details: (Seems From DC2 stopped nodes are not cleaned up)
Logs from DC2
No suspicious Logs from DC1 related to WAN join
This issue is discussed in consul forum too : https://discuss.hashicorp.com/t/docker-pause-is-causing-consul-remote-site-failure/53856/2