Open JacobCalmes opened 4 years ago
The duplicate nodes went away after about 48 hours (our reconnect timeout is the default 72 hours) when specifying the node name in the configuration file. This seems to have fixed the issue but the fact it took a few days even after attempting to force-leave -prune
seems off. Combine that with the node name being left blank on Windows causing undesired behaviors leads me to believe node name is not an optional configuration on Windows.
After a Consul server restart this issue is causing snapshots to fail sync with the following message:
Feb 26 11:48:34 mnp-dist-app10 consul[168876]: 2020/02/26 11:48:34 [ERROR] raft: Failed to restore snapshot: failed to restore snapshot 375-26025211-1582739313972: check node "ILP-MT14-SQL01" does not match node "ilp-mt14-sql01"
This is preventing this Consul node from fully starting. Waiting for an error message then deleting the duplicate client with force-leave -prune
seems to move onto the next client that is duplicated.
Overview of the Issue
Some of our agents running Windows are having random left and join events causing a service to completely de-register and come back after about 30 seconds. The odd part is that a host will have an uppercase node name and a lower case node name. The node ID's are always different. For example this is seen in a detailed member list:
The uppercase node is the correct one. The Serf Health Status will always pass but our service wmi-exporter will disappear and reappear along with it's health check.
We are running 3 Consul servers on CentOS 7 with agents running varying distros of Linux and versions of Windows.
Reproduction Steps
Reproduction seems random but only happens on Windows servers. Setting the
node_name
in the configuration seems to help sometimes but not always. We have tried completely wiping the installation of Consul, force-leave with prune, and starting Consul on the node with some success as well but it most cases it will happen again some time later.Consul info for both Client and Server
Client info
``` agent: check_monitors = 0 check_ttls = 0 checks = 1 services = 1 build: prerelease = revision = 9be6dfc3 version = 1.6.1 consul: acl = enabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 4 goroutines = 47 max_procs = 4 os = windows version = go1.12.1 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 36 failed = 53 health_score = 0 intent_queue = 0 left = 24 member_time = 163054 members = 2635 query_queue = 0 query_time = 918 ```Server info
``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 2 build: prerelease = revision = 1200f25e version = 1.6.2 consul: acl = enabled bootstrap = false known_datacenters = 1 leader = true leader_addr = 172.16.34.161:8300 server = true raft: applied_index = 23259753 commit_index = 23259753 fsm_pending = 0 last_contact = 0 last_log_index = 23259753 last_log_term = 375 last_snapshot_index = 23258826 last_snapshot_term = 375 latest_configuration = [{Suffrage:Voter ID:dc94ced3-c768-b8ed-4af4-78d71392433f Address:172.16.34.161:8300} {Suffrage:Voter ID:08ec87b0-60f5-0d2d-26f1-ed07c1d8a782 Address:172.16.34.162:8300} {Suffrage:Voter ID:908ee8fc-721d-1938-1378-40c3080714a8 Address:172.16.34.160:8300}] latest_configuration_index = 20229906 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 375 runtime: arch = amd64 cpu_count = 2 goroutines = 6479 max_procs = 2 os = linux version = go1.12.13 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 36 failed = 181 health_score = 0 intent_queue = 0 left = 19 member_time = 162973 members = 2759 query_queue = 0 query_time = 918 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 95 members = 3 query_queue = 0 query_time = 1 ```Operating system and Environment details
Server:
Agent:
Log Fragments
Server leader: