Open tifling85 opened 3 years ago
Hi @tifling85,
Can you share your Consul server agent configuration (with any sensitive parts removed) and the command you used to start the Consul server agents? In particular, I'm interested in the -bootstrap-expect
value which determines how many servers are needed before the initial leader election is triggered.
Can you also share the output of consul operator raft list-peers
in steps 2 and 3?
And is there any log information indicating a leader election takes place after restarting the service in 3?
The stale read mode docs do mention that stale reads work while a cluster is unavailable / there is no leader. However, if the cluster was never bootstrapped to begin with / never had an initial leader election, we're not sure stale reads work in that case. The information requested above should help us understand whether a leader was ever elected.
Okay, I'll try again. config file(/etc/consul.d/init.json):
{
"server": true,
"ui": true,
"advertise_addr": "10.179.37.248",
"bind_addr": "10.179.37.248",
"bootstrap_expect": 2,
"retry_join": ["10.179.37.214"],
"enable_local_script_checks": true,
"log_level": "trace"
}
Second server:
{
"server": true,
"ui": true,
"advertise_addr": "10.179.37.214",
"bind_addr": "10.179.37.214",
"bootstrap_expect": 2,
"retry_join": ["10.179.37.248"],
"enable_local_script_checks": true,
"log_level": "trace"
}
1. first start, the cluster was initialized:
[centos@tifweb-1 ~]$ consul members
Node Address Status Type Build Protocol DC Segment
tifweb-1.novalocal 10.179.37.248:8301 alive server 1.10.1 2 dc1 <all>
tifweb-2.novalocal 10.179.37.214:8301 alive server 1.10.1 2 dc1 <all>
[centos@tifweb-1 ~]$ consul operator raft list-peers
Node ID Address State Voter RaftProtocol
tifweb-2.novalocal 5e103976-5b77-b620-5e55-123bbcbb5884 10.179.37.214:8300 leader true 3
tifweb-1.novalocal bdc2ffbf-44d1-5eb4-4f0c-517a8368d983 10.179.37.248:8300 follower true 3
add a test key:
[centos@tifweb-1 ~]$ consul kv put test_key test_value
Success! Data written to: test_key
[centos@tifweb-1 ~]$ consul kv get test_key
test_value
2. Turn off the server tifweb-2:
[centos@tifweb-1 ~]$ consul operator raft list-peers
Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)
[centos@tifweb-1 ~]$ consul members
Node Address Status Type Build Protocol DC Segment
tifweb-1.novalocal 10.179.37.248:8301 alive server 1.10.1 2 dc1 <all>
tifweb-2.novalocal 10.179.37.214:8301 failed server 1.10.1 2 dc1 <all>
Checking the availability of staled data:
[centos@tifweb-1 ~]$ consul kv get -stale test_key
test_value
3.
Restart the service on current service(tifweb-1):
[centos@tifweb-1 ~]$ sudo systemctl restart consul
Staled key not available:
[centos@tifweb-1 ~]$ consul members
Node Address Status Type Build Protocol DC Segment
tifweb-1.novalocal 10.179.37.248:8301 alive server 1.10.1 2 dc1 <all>
[centos@tifweb-1 ~]$ consul kv get -stale test_key
Error querying Consul agent: Unexpected response code: 500
[centos@tifweb-1 ~]$ consul operator raft list-peers
Error getting peers: Failed to retrieve raft configuration: Unexpected response code: 500 (No cluster leader)
consul restart logs:
sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.859+0700 [WARN] agent: bootstrap_expect = 2: A cluster with 2 ser vers will provide no failure tolerance. See https://www.consul.io/docs/internals/consensus.html#deployment-table sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.859+0700 [WARN] agent: bootstrap_expect > 0: expecting 2 servers sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.859+0700 [TRACE] agent.tlsutil: Update: version=1 sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.859+0700 [TRACE] agent.tlsutil: OutgoingRPCWrapper: version=1 sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.859+0700 [TRACE] agent: parsed scheme: "consul" sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.859+0700 [TRACE] agent: ccResolverWrapper: sending update to cc: { []
} sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.862+0700 [WARN] agent.auto_config: skipping file /etc/consul.d/.i nit.json.swp, extension must be .hcl or .json, or config format must be set sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.862+0700 [WARN] agent.auto_config: skipping file /etc/consul.d/co nsul.env, extension must be .hcl or .json, or config format must be set sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.862+0700 [WARN] agent.auto_config: The 'ui' field is deprecated. Use the 'ui_config.enabled' field instead. sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.862+0700 [WARN] agent.auto_config: Node name "tifweb-1.novalocal" will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes. sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.862+0700 [WARN] agent.auto_config: bootstrap_expect = 2: A cluster with 2 servers will provide no failure tolerance. See https://www.consul.io/docs/internals/consensus.html#deployment-table sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.862+0700 [WARN] agent.auto_config: bootstrap_expect > 0: expecting 2 servers sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.862+0700 [TRACE] agent.tlsutil: Update: version=2 sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.863+0700 [TRACE] agent.tlsutil: OutgoingRPCWrapper: version=2 sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.994+0700 [INFO] agent.server.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:bdc2ffbf-44d1-5eb4-4f0c-517a8368d983 Address:10.179.37.248:8300} {Suffrage:Voter ID:5e103976-5b77-b620-5e55-123bbcbb5884 Address:10.179.37.214:8300}]" sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.994+0700 [INFO] agent.server.raft: entering follower state: follower="Node at 10.179.37.248:8300 [Follower]" leader= sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.994+0700 [INFO] agent.server.serf.wan: serf: EventMemberJoin: tifweb-1.novalocal.dc1 10.179.37.248 sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [WARN] agent.server.serf.wan: serf: Failed to re-join any previously known node sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [INFO] agent.server.serf.lan: serf: EventMemberJoin: tifweb-1.novalocal 10.179.37.248 sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [INFO] agent.router: Initializing LAN area manager sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [WARN] agent.server.serf.lan: serf: Failed to re-join any previously known node sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [TRACE] agent: ccResolverWrapper: sending update to cc: {[{10.179.37.248:8300 0 tifweb-1.novalocal }] } sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [TRACE] agent: ccResolverWrapper: sending update to cc: {[{10.179.37.248:8300 0 tifweb-1.novalocal }] } sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [TRACE] agent: addrConn: tryUpdateAddrs curAddr: { 0 }, addrs: [{10.179.37.248:8300 0 tifweb-1.novalocal }] sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [TRACE] agent: ccResolverWrapper: sending update to cc: {[{10.179.37.248:8300 0 tifweb-1.novalocal.dc1 }] } sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [TRACE] agent: addrConn: tryUpdateAddrs curAddr: { 0 }, addrs: [{10.179.37.248:8300 0 tifweb-1.novalocal.dc1 }] sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [TRACE] agent: ccResolverWrapper: sending update to cc: {[{10.179.37.248:8300 0 tifweb-1.novalocal.dc1 }] } sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [TRACE] agent: addrConn: tryUpdateAddrs curAddr: { 0 }, addrs: [{10.179.37.248:8300 0 tifweb-1.novalocal.dc1 }] sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [INFO] agent.server: Adding LAN server: server="tifweb-1.novalocal (Addr: tcp/10.179.37.248:8300) (DC: dc1)" sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [INFO] agent.server: Raft data found, disabling bootstrap mode sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [INFO] agent: Started DNS server: address=127.0.0.1:8600 network=udp sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [INFO] agent: Started DNS server: address=10.179.37.248:8600 network=tcp sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [INFO] agent: Started DNS server: address=127.0.0.1:8600 network=tcp sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [TRACE] agent: ccResolverWrapper: sending update to cc: {[{10.179.37.248:8300 0 tifweb-1.novalocal.dc1 }] } sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [TRACE] agent: addrConn: tryUpdateAddrs curAddr: { 0 }, addrs: [{10.179.37.248:8300 0 tifweb-1.novalocal.dc1 }] sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [INFO] agent: Started DNS server: address=10.179.37.248:8600 network=udp sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [WARN] agent: grpc: addrConn.createTransport failed to connect to {10.179.37.248:8300 0 tifweb-1.novalocal.dc1 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.179.37.248:8300: operation was canceled". Reconnecting... sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [TRACE] agent: ccResolverWrapper: sending update to cc: {[{10.179.37.248:8300 0 tifweb-1.novalocal.dc1 }] } sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [TRACE] agent: addrConn: tryUpdateAddrs curAddr: { 0 }, addrs: [{10.179.37.248:8300 0 tifweb-1.novalocal.dc1 }] sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [INFO] agent.server: Handled event for server in area: event=member-join server=tifweb-1.novalocal.dc1 area=wan sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.996+0700 [INFO] agent: Starting server: address=10.179.37.248:8500 network=tcp protocol=http sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.996+0700 [INFO] agent: Starting server: address=127.0.0.1:8500 network=tcp protocol=http sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.996+0700 [WARN] agent: DEPRECATED Backwards compatibility with pre-1.9 metrics enabled. These metrics will be removed in a future version of Consul. Set telemetry { disable_compat_1.9 = true }
to disable them. sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.996+0700 [INFO] agent: Retry join is supported for the following discovery methods: cluster=LAN discovery_methods="aliyun aws azure digitalocean gce k8s linode mdns os packet scaleway softlayer tencentcloud triton vsphere" sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.996+0700 [INFO] agent: Joining cluster...: cluster=LAN sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.996+0700 [INFO] agent: (LAN) joining: lan_addresses=[10.179.37.214] sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.996+0700 [INFO] agent: started state syncer sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.996+0700 [INFO] agent: Consul agent running! sep 17 17:36:37 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:37.162+0700 [DEBUG] agent.http: Request finished: method=GET url=/v1/agent/members?segment=_all from=127.0.0.1:48472 latency=87.135µs sep 17 17:36:39 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:39.418+0700 [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader" sep 17 17:36:40 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:40.994+0700 [WARN] agent.server.raft: heartbeat timeout reached, starting election: last-leader= sep 17 17:36:40 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:40.994+0700 [INFO] agent.server.raft: entering candidate state: node="Node at 10.179.37.248:8300 [Candidate]" term=144 sep 17 17:36:41 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:41.340+0700 [DEBUG] agent.server.raft: votes: needed=2 sep 17 17:36:41 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:41.340+0700 [DEBUG] agent.server.raft: vote granted: from=bdc2ffbf-44d1-5eb4-4f0c-517a8368d983 term=144 tally=1 sep 17 17:36:41 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:41.340+0700 [WARN] agent.server.raft: unable to get address for server, using fallback address: id=5e103976-5b77-b620-5e55-123bbcbb5884 fallback=10.179.37.214:8300 error="Could not find address for server id 5e103976-5b77-b620-5e55-123bbcbb5884" sep 17 17:36:42 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:42.000+0700 [DEBUG] agent.server.memberlist.lan: memberlist: Failed to join 10.179.37.214: dial tcp 10.179.37.214:8301: i/o timeout sep 17 17:36:42 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:42.000+0700 [WARN] agent: (LAN) couldn't join: number_of_nodes=0 error="1 error occurred: sep 17 17:36:42 tifweb-1.novalocal consul[70258]: * Failed to join 10.179.37.214: dial tcp 10.179.37.214:8301: i/o timeout sep 17 17:36:42 tifweb-1.novalocal consul[70258]: " sep 17 17:36:42 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:42.000+0700 [WARN] agent: Join cluster failed, will retry: cluster=LAN retry_interval=30s error=sep 17 17:36:50 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:50.560+0700 [WARN] agent.server.raft: Election timeout reached, restarting election sep 17 17:36:50 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:50.560+0700 [INFO] agent.server.raft: entering candidate state: node="Node at 10.179.37.248:8300 [Candidate]" term=145 sep 17 17:36:50 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:50.865+0700 [DEBUG] agent.server.raft: votes: needed=2 sep 17 17:36:50 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:50.865+0700 [DEBUG] agent.server.raft: vote granted: from=bdc2ffbf-44d1-5eb4-4f0c-517a8368d983 term=145 tally=1 sep 17 17:36:50 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:50.865+0700 [WARN] agent.server.raft: unable to get address for server, using fallback address: id=5e103976-5b77-b620-5e55-123bbcbb5884 fallback=10.179.37.214:8300 error="Could not find address for server id 5e103976-5b77-b620-5e55-123bbcbb5884" sep 17 17:36:51 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:51.341+0700 [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 5e103976-5b77-b620-5e55-123bbcbb5884 10.179.37.214:8300}" error="dial tcp 10.179.37.248:0->10.179.37.214:8300: i/o timeout" sep 17 17:37:00 tifweb-1.novalocal consul[70258]: 2021-09-17T17:37:00.214+0700 [ERROR] agent: Coordinate update error: error="No cluster leader" sep 17 17:37:00 tifweb-1.novalocal consul[70258]: 2021-09-17T17:37:00.335+0700 [WARN] agent.server.raft: Election timeout reached, restarting election sep 17 17:37:00 tifweb-1.novalocal consul[70258]: 2021-09-17T17:37:00.335+0700 [INFO] agent.server.raft: entering candidate state: node="Node at 10.179.37.248:8300 [Candidate]" term=146 sep 17 17:37:00 tifweb-1.novalocal consul[70258]: 2021-09-17T17:37:00.458+0700 [INFO] agent: Newer Consul version available: new_version=1.10.2 current_version=1.10.1 sep 17 17:37:00 tifweb-1.novalocal consul[70258]: 2021-09-17T17:37:00.469+0700 [DEBUG] agent.server.raft: votes: needed=2 sep 17 17:37:00 tifweb-1.novalocal consul[70258]: 2021-09-17T17:37:00.469+0700 [DEBUG] agent.server.raft: vote granted: from=bdc2ffbf-44d1-5eb4-4f0c-517a8368d983 term=146 tally=1 sep 17 17:37:00 tifweb-1.novalocal consul[70258]: 2021-09-17T17:37:00.469+0700 [WARN] agent.server.raft: unable to get address for server, using fallback address: id=5e103976-5b77-b620-5e55-123bbcbb5884 fallback=10.179.37.214:8300 error="Could not find address for server id 5e103976-5b77-b620-5e55-123bbcbb5884" sep 17 17:37:00 tifweb-1.novalocal consul[70258]: 2021-09-17T17:37:00.866+0700 [ERROR] agent.server.raft: failed to make requestVote RPC: target="{Voter 5e103976-5b77-b620-5e55-123bbcbb5884 10.179.37.214:8300}" error="dial tcp 10.179.37.248:0->10.179.37.214:8300: i/o timeout" sep 17 17:37:02 tifweb-1.novalocal consul[70258]: 2021-09-17T17:37:02.042+0700 [ERROR] agent.anti_entropy: failed to sync remote state: error="No cluster leader" sep 17 17:37:02 tifweb-1.novalocal consul[70258]: 2021-09-17T17:37:02.114+0700 [ERROR] agent.http: Request error: method=GET url=/v1/operator/raft/configuration from=127.0.0.1:48474 error="No cluster leader" sep 17 17:37:02 tifweb-1.novalocal consul[70258]: 2021-09-17T17:37:02.114+0700 [DEBUG] agent.http: Request finished: method=GET url=/v1/operator/raft/configuration from=127.0.0.1:48474 latency=7.147320577s sep 17 17:37:07 tifweb-1.novalocal consul[70258]: 2021-09-17T17:37:07.385+0700 [WARN] agent.server.raft: Election timeout reached, restarting election
I see the log:
sep 17 17:36:31 tifweb-1.novalocal consul[70258]: 2021-09-17T17:36:31.995+0700 [INFO] agent.server: Raft data found, disabling bootstrap mode
i guess consul will not initiate the cluster again.
Thanks!
Overview of the Issue
Hello. There is a destroyed consul cluster. I can get stale data successfully. But after restarting it, the data becomes inaccessible. Is it possible to get cached data from the consul with a broken cluster after a restart? Thanks.
Reproduction Steps
Create a destroyed cluster:
Checking the availability of staled data:
Restart the service:
Trying to get data (unsuccessfully):
Consul info for both Client and Server
Client/Server info
``` agent: check_monitors = 0 check_ttls = 0 checks = 1 services = 1 build: prerelease = revision = db839f18 version = 1.10.1 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = false leader_addr = server = true raft: applied_index = 49162 commit_index = 0 fsm_pending = 0 last_contact = never last_log_index = 49757 last_log_term = 17401 last_snapshot_index = 49162 last_snapshot_term = 17278 latest_configuration = [{Suffrage:Voter ID:4e373bb7-602a-c849-f6bf-270e1c990ac4 Address:10.179.37.210:8300} {Suffrage:Voter ID:96a36e0a-3f87-639d-c9cd-c8e75e829484 Address:10.179.37.156:8300}] latest_configuration_index = 0 num_peers = 1 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Candidate term = 17426 runtime: arch = amd64 cpu_count = 3 goroutines = 104 max_procs = 3 os = linux version = go1.16.6 serf_lan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 21 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 2958 members = 1 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = false event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 1494 members = 1 query_queue = 0 query_time = 1 ```Operating system and Environment details
Log Fragments