Periodically consul stops accepting connections at port 8300 for short period of time

Overview of the Issue

I'm having periodic (everyday!) problem when Consul instances became unavailable for short period of time ~1-2min and I'm failed to investigate what is the cause. This downtime causes dependent services crash or restart (by Nomad) so it's kinda big deal for us.

Topology: 12 bare metal machines (with consul agents) + 3 cloud instances for consul servers.

At least once a day connection between consul nodes got lost for short period of time (its not a network issue, I'm sure!). See the logs below.

What can cause this?

Reproduction Steps

Unknown at this moment.

Consul info for both Client and Server

Client info

``` agent: check_monitors = 0 check_ttls = 0 checks = 3 services = 3 build: prerelease = revision = 6530cf37 version = 1.9.2 consul: acl = enabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 16 goroutines = 72 max_procs = 16 os = linux version = go1.15.6 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 12 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 106 members = 15 query_queue = 0 query_time = 1 ```

Server info

``` agent: check_monitors = 0 check_ttls = 1 checks = 7 services = 7 build: prerelease = revision = 6530cf37 version = 1.9.2 consul: acl = enabled bootstrap = false known_datacenters = 1 leader = false leader_addr = redacted:8300 server = true raft: applied_index = 257845 commit_index = 257845 fsm_pending = 0 last_contact = 30.443882ms last_log_index = 257845 last_log_term = 30 last_snapshot_index = 245903 last_snapshot_term = 30 latest_configuration = [{Suffrage:Voter ID:2525b2d0-e6f2-59cd-83aa-a4dcfd708667 Address:redacted:8300} {Suffrage:Voter ID:d70579fd-e1f1-5359-b6fe-17a69897fe16 Address:redacted:8300} {Suffrage:Voter ID:32b006a1-fc82-51b1-86bd-c91b93d3cd49 Address:redacted:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 30 runtime: arch = amd64 cpu_count = 2 goroutines = 157 max_procs = 2 os = linux version = go1.15.6 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 12 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 106 members = 15 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 9 members = 3 query_queue = 0 query_time = 1 ```

Operating system and Environment details

Ubuntu 20.04.1 LTS Linux devops-1 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

$ ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-m: resident set size (kbytes)      unlimited
-u: processes                       65536
-n: file descriptors                65536
-l: locked-in-memory size (kbytes)  unlimited
-v: address space (kbytes)          unlimited
-x: file locks                      unlimited
-i: pending signals                 514741
-q: bytes in POSIX msg queues       819200
-e: max nice                        0
-r: max rt priority                 0
-N 15:                              unlimited

Log Fragments

Cumulative log from all nodes, latest incident lasted ~1 min:

Time    @message    hostname    ip_address  error
    Jan 27, 2021 @ 09:19:32.939 RPC failed to server    mbz-db-1    xxx.xxx.xxx.183 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:19:32.939 Request error   mbz-db-1    xxx.xxx.xxx.183 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:19:32.939 RPC failed to server    mbz-db-1    xxx.xxx.xxx.183 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:19:32.939 failed to sync remote state mbz-db-1    xxx.xxx.xxx.183 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:37.422 multiplex conn accept failed    devops-2    xxx.xxx.xxx.197 keepalive timeout
    Jan 27, 2021 @ 09:18:37.422 multiplex conn accept failed    devops-2    xxx.xxx.xxx.197 keepalive timeout
    Jan 27, 2021 @ 09:18:34.490 multiplex conn accept failed    devops-1    xxx.xxx.xxx.126 keepalive timeout
    Jan 27, 2021 @ 09:18:34.437 RPC failed to server    mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 RPC failed to server    mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 rpc error   mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 Coordinate update error mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 RPC failed to server    mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 rpc error   mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 RPC failed to server    mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 RPC failed to server    mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 Request error   mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 RPC failed to server    mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 Request error   mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 RPC failed to server    mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 Request error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 Request error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:29.266 RPC failed to server    mbz-es-3    xxx.xxx.xxx.118 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:29.266 RPC failed to server    mbz-es-3    xxx.xxx.xxx.118 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:29.266 Request error   mbz-es-3    xxx.xxx.xxx.118 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:29.266 RPC failed to server    mbz-es-3    xxx.xxx.xxx.118 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:29.266 Coordinate update error mbz-es-3    xxx.xxx.xxx.118 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:29.266 RPC failed to server    mbz-es-3    xxx.xxx.xxx.118 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:29.266 RPC failed to server    mbz-es-3    xxx.xxx.xxx.118 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:29.266 Request error   mbz-es-3    xxx.xxx.xxx.118 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:26.515 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: dial tcp xxx.xxx.xxx.246:0->xxx.xxx.xxx.126:8300: connect: connection timed out
    Jan 27, 2021 @ 09:18:26.515 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 Request error   mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 Coordinate update error mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:23.846 RPC failed to server    mbz-es-2    xxx.xxx.xxx.119 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:23.846 Request error   mbz-es-2    xxx.xxx.xxx.119 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:23.846 RPC failed to server    mbz-es-2    xxx.xxx.xxx.119 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:23.846 Coordinate update error mbz-es-2    xxx.xxx.xxx.119 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:23.846 RPC failed to server    mbz-es-2    xxx.xxx.xxx.119 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:23.846 Request error   mbz-es-2    xxx.xxx.xxx.119 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:23.846 RPC failed to server    mbz-es-2    xxx.xxx.xxx.119 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.857 multiplex conn accept failed    devops-3    xxx.xxx.xxx.123 keepalive timeout
    Jan 27, 2021 @ 09:18:18.857 multiplex conn accept failed    devops-3    xxx.xxx.xxx.123 keepalive timeout
    Jan 27, 2021 @ 09:18:18.213 Syncing check failed.   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.213 failed to sync changes  mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.213 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.213 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.213 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.213 Request error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.212 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.212 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.212 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.212 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.212 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.212 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.212 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.212 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF

hashicorp / consul