Repeated join and leave events for Consul agent on Windows

JacobCalmes commented 4 years ago

Overview of the Issue

Some of our agents running Windows are having random left and join events causing a service to completely de-register and come back after about 30 seconds. The odd part is that a host will have an uppercase node name and a lower case node name. The node ID's are always different. For example this is seen in a detailed member list:

MNP-OSW-APP02                172.16.113.247:8301   alive    acls=1,build=1.6.1:9be6dfc3,dc=tcf-dc1-prd,id=96c6c19f-706c-0040-9ac3-6a941bd9194b,role=node,segment=<default>,vsn=2,vsn_max=3,vsn_min=2
MNP-OSW-WEB01                172.16.34.25:8301     alive    acls=1,build=1.6.1:9be6dfc3,dc=tcf-dc1-prd,id=67ce4930-25a1-2372-6ca0-e6e9b1f3af93,role=node,segment=<default>,vsn=2,vsn_max=3,vsn_min=2
mnp-osw-app02                172.16.113.247:8301   left     acls=1,build=1.6.1:9be6dfc3,dc=tcf-dc1-prd,id=77a3b884-d38b-597a-4054-fcb3d77097cc,role=node,segment=<default>,vsn=2,vsn_max=3,vsn_min=2
mnp-osw-web01                172.16.34.25:8301     left     acls=1,build=1.6.1:9be6dfc3,dc=tcf-dc1-prd,id=b745b322-aa18-2f04-5cc2-fcb497903cc1,role=node,segment=<default>,vsn=2,vsn_max=3,vsn_min=2

The uppercase node is the correct one. The Serf Health Status will always pass but our service wmi-exporter will disappear and reappear along with it's health check.

We are running 3 Consul servers on CentOS 7 with agents running varying distros of Linux and versions of Windows.

Reproduction Steps

Reproduction seems random but only happens on Windows servers. Setting the node_name in the configuration seems to help sometimes but not always. We have tried completely wiping the installation of Consul, force-leave with prune, and starting Consul on the node with some success as well but it most cases it will happen again some time later.

Consul info for both Client and Server

Client info

``` agent: check_monitors = 0 check_ttls = 0 checks = 1 services = 1 build: prerelease = revision = 9be6dfc3 version = 1.6.1 consul: acl = enabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 4 goroutines = 47 max_procs = 4 os = windows version = go1.12.1 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 36 failed = 53 health_score = 0 intent_queue = 0 left = 24 member_time = 163054 members = 2635 query_queue = 0 query_time = 918 ```

Server info

``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 2 build: prerelease = revision = 1200f25e version = 1.6.2 consul: acl = enabled bootstrap = false known_datacenters = 1 leader = true leader_addr = 172.16.34.161:8300 server = true raft: applied_index = 23259753 commit_index = 23259753 fsm_pending = 0 last_contact = 0 last_log_index = 23259753 last_log_term = 375 last_snapshot_index = 23258826 last_snapshot_term = 375 latest_configuration = [{Suffrage:Voter ID:dc94ced3-c768-b8ed-4af4-78d71392433f Address:172.16.34.161:8300} {Suffrage:Voter ID:08ec87b0-60f5-0d2d-26f1-ed07c1d8a782 Address:172.16.34.162:8300} {Suffrage:Voter ID:908ee8fc-721d-1938-1378-40c3080714a8 Address:172.16.34.160:8300}] latest_configuration_index = 20229906 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 375 runtime: arch = amd64 cpu_count = 2 goroutines = 6479 max_procs = 2 os = linux version = go1.12.13 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 36 failed = 181 health_score = 0 intent_queue = 0 left = 19 member_time = 162973 members = 2759 query_queue = 0 query_time = 918 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 95 members = 3 query_queue = 0 query_time = 1 ```

Operating system and Environment details

Server:

CentOS 7
version 1.6.2

{
  "data_dir": "/var/lib/consul",
  "client_addr": "0.0.0.0",
  "bind_addr": "0.0.0.0",
  "node_name": "mnp-dist-app11.corp.tcf.biz",
  "advertise_addr": "172.16.34.161",
  "datacenter": "tcf-dc1-prd",
  "primary_datacenter": "tcf-dc1-prd",
  "telemetry": {
    "prometheus_retention_time": "30s",
    "disable_hostname": true
  },
  "node_meta": {
    "team": "system-insights",
    "environment": "prod"
  },
  "performance": {
    "raft_multiplier": 5
  },
  "dns_config": {
    "allow_stale": true
  },
  "discovery_max_stale": "5s",
  "retry_join": [
    "mnp-dist-app12.corp.tcf.biz","mnp-dist-app11.corp.tcf.biz","mnp-dist-app10.corp.tcf.biz"
  ],
  "encrypt": "...",
  "acl": {
    "enabled": true,
    "default_policy": "deny",
    "down_policy": "extend-cache",
    "tokens": {
      "default": "...",
      "master": "...",
      "agent": ".."
    }
  }
}

Agent:

Windows Server 2016
version 1.6.1

{
  "advertise_addr": "{{ GetDefaultInterfaces | exclude \"type\" \"IPv6\" | limit 1 | attr \"address\" }}",
  "node_name": "MNP-OSW-APP02",
  "data_dir": "C:/insights-client/consul/data",
  "log_file": "C:/insights-client/consul/logs/consul.log",
  "log_level": "err",
  "log_rotate_duration": "24h",
  "log_rotate_max_files": 7,
  "datacenter": "tcf-dc1-prd",
  "primary_datacenter": "tcf-dc1-prd",
  "client_addr": "127.0.0.1",
  "bind_addr": "0.0.0.0",
  "disable_update_check": true,
  "rejoin_after_leave": true,
  "ports": {
    "http": 8500,
    "dns": 8600,
    "serf_lan": 8301,
    "serf_wan": 8302
  },
  "retry_join": [
    "mnp-dist-app12.corp.tcf.biz","mnp-dist-app11.corp.tcf.biz","mnp-dist-app10.corp.tcf.biz"
  ],
  "encrypt": "...",
  "acl": {
    "enabled": true,
    "default_policy": "deny",
    "down_policy": "extend-cache",
    "tokens": {
      "default": "...",
      "agent": "..."
    }
  }
}

Log Fragments

Server leader:

2020/02/13 11:22:29 [INFO] consul: member 'MNP-OSW-APP02' joined, marking health alive
2020/02/13 12:00:42 [INFO] serf: EventMemberJoin: mnp-osw-app02 172.16.113.247
2020/02/13 12:00:56 [INFO] memberlist: Marking mnp-osw-app02 as failed, suspect timeout reached (2 peer confirmations)
2020/02/13 12:00:56 [INFO] serf: EventMemberFailed: mnp-osw-app02 172.16.113.247
2020/02/13 12:00:56 [INFO] consul: member 'mnp-osw-app02' failed, marking health critical
2020/02/13 12:01:31 [INFO] consul: member 'MNP-OSW-APP02' joined, marking health alive
2020/02/13 12:01:31 [INFO] consul: member 'mnp-osw-app02' failed, marking health critical
2020/02/13 12:02:29 [INFO] consul: member 'MNP-OSW-APP02' joined, marking health alive
2020/02/13 12:02:29 [INFO] consul: member 'mnp-osw-app02' failed, marking health critical
2020/02/13 12:03:30 [INFO] consul: member 'MNP-OSW-APP02' joined, marking health alive
2020/02/13 12:03:30 [INFO] consul: member 'mnp-osw-app02' failed, marking health critical
2020/02/13 12:04:31 [INFO] consul: member 'MNP-OSW-APP02' joined, marking health alive
2020/02/13 12:04:31 [INFO] consul: member 'mnp-osw-app02' failed, marking health critical

JacobCalmes commented 4 years ago

The duplicate nodes went away after about 48 hours (our reconnect timeout is the default 72 hours) when specifying the node name in the configuration file. This seems to have fixed the issue but the fact it took a few days even after attempting to force-leave -prune seems off. Combine that with the node name being left blank on Windows causing undesired behaviors leads me to believe node name is not an optional configuration on Windows.

JacobCalmes commented 4 years ago

After a Consul server restart this issue is causing snapshots to fail sync with the following message:

Feb 26 11:48:34 mnp-dist-app10 consul[168876]: 2020/02/26 11:48:34 [ERROR] raft: Failed to restore snapshot: failed to restore snapshot 375-26025211-1582739313972: check node "ILP-MT14-SQL01" does not match node "ilp-mt14-sql01"

This is preventing this Consul node from fully starting. Waiting for an error message then deleting the duplicate client with force-leave -prune seems to move onto the next client that is duplicated.

hashicorp / consul