Cluster join i/o failure leading to slow client joins (over time)

pln-rec commented 3 years ago

Overview of the Issue

As part of our infrastructure, we are using a Consul cluster with three servers and a number of client agents who are also frequently joining and leaving the cluster. After (re-)starting the cluster (=starting all servers), client agent joining is fast as usual. But after several hours of joining and leaving, most and eventually all of the join operations become slower (takes around 10s+), although they finally succeed. This hinders fast scale out for our services, because the main service needs to wait for the Consul client to join the cluster before it can use service discovery.

We are basically seeking for advice how to narrow down and further investigate the problem. What could be the cause of this slowly "degrading" behavior?

Reproduction Steps

Unfortunately, we were not able to reproduce the issue in a smaller environment, yet.

Consul info for both Client and Server

The clients that are showing this problem are running as containers on ECS Fargate (see below) and we are missing a (practical) way to execute consul info there. This is the configuration of the ECS sidecar Consul clients:

{
  "bind_addr": "{{GetPrivateIP}}",
  "ca_file": "/consul/extra/consulAgent.ca.pem",
  "cert_file": "/consul/extra/consulAgent.crt",
  "datacenter": "dcname",
  "disable_update_check": true,
  "encrypt": "*******************",
  "key_file": "/consul/extra/consulAgent.key",
  "leave_on_terminate": true,
  "ports": {
    "dns": -1,
    "http": -1,
    "https": 8500
  },
  "rejoin_after_leave": true,
  "retry_join": [
    "provider=\"aws\" region=\"us-east-1\" addr_type=\"private_v4\" tag_key=\"aws:cloudformation:stack-name\" tag_value=\"PODNAMECC001\""
  ],
  "server": false,
  "ui": false,
  "verify_outgoing": true,
  "log_level": "trace"
}

Server info

``` agent: check_monitors = 0 check_ttls = 0 checks = 0 services = 0 build: prerelease = revision = db839f18 version = 1.10.1 consul: acl = disabled bootstrap = false known_datacenters = 1 leader = true leader_addr = ***.**.109.107:8300 server = true raft: applied_index = 4349 commit_index = 4349 fsm_pending = 0 last_contact = 0 last_log_index = 4349 last_log_term = 2 last_snapshot_index = 0 last_snapshot_term = 0 latest_configuration = [{Suffrage:Voter ID:134c79d3-e2f9-d4e0-a547-fec5616cbdc9 Address:***.**.109.107:8300} {Suffrage:Voter ID:6cdd57c1-ca31-f0b7-2164-0cd5d623d183 Address:***.**.108.218:8300} {Suffrage:Voter ID:3eee12ed-9eae-b664-b545-90c8658d91dd Address:***.**.109.12:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Leader term = 2 runtime: arch = amd64 cpu_count = 2 goroutines = 151 max_procs = 2 os = linux version = go1.16.6 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 2 failed = 0 health_score = 0 intent_queue = 0 left = 132 member_time = 1652 members = 140 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 4 members = 3 query_queue = 0 query_time = 1 ```

Operating system and Environment details

Consul version: 1.10.1
Environment: AWS, Consul runs inside a VPC (in the example below: IPv4 CIDR: *..108.0/23).
Servers: 3 Consul servers running on 3 Linux (Ubuntu 18.04.5 LTS) EC2 machines (ASG, but configured stable at 3 instances).
Clients: Clients are running on other EC2 instances (Windows, static) or as a sidecar container (Linux) in ECS Fargate (based on the Consul Docker image). Because of up/down scaling of our services running in ECS, the sidecar client agents are joining and leaving frequently (around 30-250 times an hour).
Security Groups: The relevant security groups allow for TCP and UDP access (ingress and egress) on ports 8300, 8301 and 8302.
Note: Since this VPC has a rather small address space, IP addresses for the ECS client agents are re-used eventually.

Log Fragments

In this example, the servers have the IP addresses [*.*.109.107 .**.109.12 *..108.218]

Client joining (slow)

==> Starting Consul agent...
           Version: '1.10.1'
           Node ID: 'e7b61cf6-3a71-4a6a-e499-b46793d1affa'
         Node name: 'ip-***-**-109-93.ec2.internal'
        Datacenter: 'dcname' (Segment: '')
            Server: false (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: -1, HTTPS: 8500, gRPC: 8502, DNS: -1)
      Cluster Addr: ***.**.109.93 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: false, Auto-Encrypt-TLS: false
==> Log data will now stream in as it occurs:

2021-09-09T13:55:22.332Z [INFO]  agent: Joining cluster...: cluster=LAN
...
2021-09-09T13:55:22.674Z [INFO]  agent: (LAN) joining: lan_addresses=[***.**.109.107, ***.**.109.12, ***.**.108.218]
2021-09-09T13:55:32.674Z [DEBUG] agent.client.memberlist.lan: memberlist: Failed to join ***.**.109.107: dial tcp ***.**.109.107:8301: i/o timeout
2021-09-09T13:55:32.676Z [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with:  ***.**.109.12:8301
...
2021-09-09T13:55:32.693Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=2

There is a 10 seconds gap between Joining cluster and the following error message (logged as DEBUG) about the i/o timeout. After that, the join happens virtually immediately. Note that the problem can be observed with all of the 3 servers, depending on which server the client agent tries to join first.

After this, the client leaves again gracefully as intended (left in the memberlist).

Here is the full log.

Server log for the same timeframe

This is a log excerpt of Consul server ***.**.109.107 (no messages about client ***.**.109.93 before this).

2021-09-09T13:55:31.458Z [INFO]  agent.server.serf.lan: serf: EventMemberJoin: ip-***-**-109-93.ec2.internal ***.**.109.93
2021-09-09T13:55:31.458Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:31.458Z [INFO]  agent.server: member joined, marking health alive: member=ip-***-**-109-93.ec2.internal
2021-09-09T13:55:31.584Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:31.637Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:31.784Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:31.784Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:31.830Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:31.831Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:31.837Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:32.031Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:32.037Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:32.039Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:32.039Z [DEBUG] agent.server.serf.lan: serf: messageJoinType: ip-***-**-109-93.ec2.internal
2021-09-09T13:55:33.252Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=2
2021-09-09T13:55:33.252Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=2
2021-09-09T13:55:33.307Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=2
2021-09-09T13:55:33.307Z [TRACE] agent.tlsutil: IncomingRPCConfig: version=2

Here is a bigger log excerpt. Whats surprising about this log is that there are several server rebalacings during the relatively short timeframe of around 8 minutes and the number_of_servers=1 seems odd as well because the cluster has 3 servers. Is this a normal behavior?

Example of a fast join

Shortly after the restart of the Consul servers, joins are fast, like this one:

==> Starting Consul agent...
           Version: '1.10.1'
           Node ID: '46f36f54-467b-0dfd-3c72-270471e12c26'
         Node name: 'ip-***-**-109-44.ec2.internal'
        Datacenter: 'dcname' (Segment: '')
            Server: false (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: -1, HTTPS: 8500, gRPC: 8502, DNS: -1)
      Cluster Addr: ***.**.109.44 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: false, Auto-Encrypt-TLS: false
==> Log data will now stream in as it occurs:
2021-09-09T10:02:12.689Z [INFO]  agent: Consul agent running!
...
2021-09-09T10:02:13.058Z [INFO]  agent: Discovered servers: cluster=LAN cluster=LAN servers="***.**.109.107 ***.**.109.12 ***.**.108.218"
2021-09-09T10:02:13.058Z [INFO]  agent: (LAN) joining: lan_addresses=[***.**.109.107, ***.**.109.12, ***.**.108.218]
2021-09-09T10:02:13.059Z [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with:  ***.**.109.107:8301
2021-09-09T10:02:13.061Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: ip-***-**-109-107 ***.**.109.107
2021-09-09T10:02:13.061Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: ip-***-**-109-12 ***.**.109.12
2021-09-09T10:02:13.061Z [INFO]  agent.client: adding server: server="ip-***-**-109-107 (Addr: tcp/***.**.109.107:8300) (DC: dcname)"
2021-09-09T10:02:13.061Z [INFO]  agent.client: adding server: server="ip-***-**-109-12 (Addr: tcp/***.**.109.12:8300) (DC: dcname)"
2021-09-09T10:02:13.061Z [INFO]  agent.client.serf.lan: serf: EventMemberJoin: ip-***-**-108-218 ***.**.108.218
2021-09-09T10:02:13.062Z [INFO]  agent.client: adding server: server="ip-***-**-108-218 (Addr: tcp/***.**.108.218:8300) (DC: dcname)"
2021-09-09T10:02:13.062Z [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with:  ***.**.109.12:8301
2021-09-09T10:02:13.062Z [WARN]  agent: grpc: addrConn.createTransport failed to connect to {***.**.109.12:8300 0 ip-***-**-109-12 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp ***.**.109.12:8300: operation was canceled". Reconnecting...
2021-09-09T10:02:13.065Z [DEBUG] agent.client.memberlist.lan: memberlist: Initiating push/pull sync with:  ***.**.108.218:8301
2021-09-09T10:02:13.078Z [INFO]  agent: (LAN) joined: number_of_nodes=3
2021-09-09T10:02:13.078Z [DEBUG] agent: systemd notify failed: error="No socket"
2021-09-09T10:02:13.078Z [INFO]  agent: Join cluster completed. Synced with initial agents: cluster=LAN num_agents=3

Thank you very much in advance!

erichaberkorn commented 2 years ago

I've been performance testing Consul on ECS Fargate over the past couple of months and have observed similar behaviors. Last week we merged a PR that solved the gossip issues I observed when Consul clients continuously join and leave. We are planning on including this change int the next 1.11 patch release and I really hope it helps resolve these issues.

pln-rec commented 2 years ago

@erichaberkorn Thank you very much, this is great news. We'll be happy to test this again once it's released.

hashicorp / consul