hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.41k stars 4.43k forks source link

Periodically consul stops accepting connections at port 8300 for short period of time #9649

Open pySilver opened 3 years ago

pySilver commented 3 years ago

Overview of the Issue

I'm having periodic (everyday!) problem when Consul instances became unavailable for short period of time ~1-2min and I'm failed to investigate what is the cause. This downtime causes dependent services crash or restart (by Nomad) so it's kinda big deal for us.

Topology: 12 bare metal machines (with consul agents) + 3 cloud instances for consul servers.

At least once a day connection between consul nodes got lost for short period of time (its not a network issue, I'm sure!). See the logs below.

What can cause this?

Reproduction Steps

Unknown at this moment.

Consul info for both Client and Server

Client info ``` agent: check_monitors = 0 check_ttls = 0 checks = 3 services = 3 build: prerelease = revision = 6530cf37 version = 1.9.2 consul: acl = enabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 16 goroutines = 72 max_procs = 16 os = linux version = go1.15.6 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 12 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 106 members = 15 query_queue = 0 query_time = 1 ```
Server info ``` agent: check_monitors = 0 check_ttls = 1 checks = 7 services = 7 build: prerelease = revision = 6530cf37 version = 1.9.2 consul: acl = enabled bootstrap = false known_datacenters = 1 leader = false leader_addr = redacted:8300 server = true raft: applied_index = 257845 commit_index = 257845 fsm_pending = 0 last_contact = 30.443882ms last_log_index = 257845 last_log_term = 30 last_snapshot_index = 245903 last_snapshot_term = 30 latest_configuration = [{Suffrage:Voter ID:2525b2d0-e6f2-59cd-83aa-a4dcfd708667 Address:redacted:8300} {Suffrage:Voter ID:d70579fd-e1f1-5359-b6fe-17a69897fe16 Address:redacted:8300} {Suffrage:Voter ID:32b006a1-fc82-51b1-86bd-c91b93d3cd49 Address:redacted:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 30 runtime: arch = amd64 cpu_count = 2 goroutines = 157 max_procs = 2 os = linux version = go1.15.6 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 12 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 106 members = 15 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 9 members = 3 query_queue = 0 query_time = 1 ```

Operating system and Environment details

Ubuntu 20.04.1 LTS Linux devops-1 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

$ ulimit -a
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-m: resident set size (kbytes)      unlimited
-u: processes                       65536
-n: file descriptors                65536
-l: locked-in-memory size (kbytes)  unlimited
-v: address space (kbytes)          unlimited
-x: file locks                      unlimited
-i: pending signals                 514741
-q: bytes in POSIX msg queues       819200
-e: max nice                        0
-r: max rt priority                 0
-N 15:                              unlimited

Log Fragments

Cumulative log from all nodes, latest incident lasted ~1 min:

Time    @message    hostname    ip_address  error
    Jan 27, 2021 @ 09:19:32.939 RPC failed to server    mbz-db-1    xxx.xxx.xxx.183 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:19:32.939 Request error   mbz-db-1    xxx.xxx.xxx.183 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:19:32.939 RPC failed to server    mbz-db-1    xxx.xxx.xxx.183 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:19:32.939 failed to sync remote state mbz-db-1    xxx.xxx.xxx.183 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:37.422 multiplex conn accept failed    devops-2    xxx.xxx.xxx.197 keepalive timeout
    Jan 27, 2021 @ 09:18:37.422 multiplex conn accept failed    devops-2    xxx.xxx.xxx.197 keepalive timeout
    Jan 27, 2021 @ 09:18:34.490 multiplex conn accept failed    devops-1    xxx.xxx.xxx.126 keepalive timeout
    Jan 27, 2021 @ 09:18:34.437 RPC failed to server    mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 RPC failed to server    mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 rpc error   mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 Coordinate update error mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 RPC failed to server    mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 rpc error   mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 RPC failed to server    mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 RPC failed to server    mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 Request error   mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 RPC failed to server    mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 Request error   mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:34.437 RPC failed to server    mbz-es-1    xxx.xxx.xxx.123 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 Request error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:32.736 Request error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:29.266 RPC failed to server    mbz-es-3    xxx.xxx.xxx.118 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:29.266 RPC failed to server    mbz-es-3    xxx.xxx.xxx.118 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:29.266 Request error   mbz-es-3    xxx.xxx.xxx.118 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:29.266 RPC failed to server    mbz-es-3    xxx.xxx.xxx.118 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:29.266 Coordinate update error mbz-es-3    xxx.xxx.xxx.118 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:29.266 RPC failed to server    mbz-es-3    xxx.xxx.xxx.118 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:29.266 RPC failed to server    mbz-es-3    xxx.xxx.xxx.118 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:29.266 Request error   mbz-es-3    xxx.xxx.xxx.118 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:26.515 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: dial tcp xxx.xxx.xxx.246:0->xxx.xxx.xxx.126:8300: connect: connection timed out
    Jan 27, 2021 @ 09:18:26.515 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 Request error   mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 Coordinate update error mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:26.515 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
    Jan 27, 2021 @ 09:18:23.846 RPC failed to server    mbz-es-2    xxx.xxx.xxx.119 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:23.846 Request error   mbz-es-2    xxx.xxx.xxx.119 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:23.846 RPC failed to server    mbz-es-2    xxx.xxx.xxx.119 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:23.846 Coordinate update error mbz-es-2    xxx.xxx.xxx.119 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:23.846 RPC failed to server    mbz-es-2    xxx.xxx.xxx.119 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:23.846 Request error   mbz-es-2    xxx.xxx.xxx.119 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:23.846 RPC failed to server    mbz-es-2    xxx.xxx.xxx.119 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.857 multiplex conn accept failed    devops-3    xxx.xxx.xxx.123 keepalive timeout
    Jan 27, 2021 @ 09:18:18.857 multiplex conn accept failed    devops-3    xxx.xxx.xxx.123 keepalive timeout
    Jan 27, 2021 @ 09:18:18.213 Syncing check failed.   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.213 failed to sync changes  mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.213 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.213 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.213 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.213 Request error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.212 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.212 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.212 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.212 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.212 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.212 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.212 rpc error   mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
    Jan 27, 2021 @ 09:18:18.212 RPC failed to server    mbz-web-1   xxx.xxx.xxx.246 rpc error making call: EOF
pySilver commented 3 years ago

In a meantime I've adjusted some settings like: limits & dns_config & set performance.raft_multiplier from 1 to 5; resulting master configs looking the following way:

{
    "acl": {
        "default_policy": "allow",
        "down_policy": "extend-cache",
        "enable_token_persistence": true,
        "enabled": true,
        "token_ttl": "30s",
        "tokens": {
            "agent": "redacted",
            "default": "redacted",
            "master": "redacted",
            "replication": "redacted"
        }
    },
    "addresses": {
        "dns": "0.0.0.0",
        "grpc": "0.0.0.0",
        "http": "0.0.0.0",
        "https": "0.0.0.0"
    },
    "advertise_addr": "redacted",
    "advertise_addr_wan": "redacted",
    "autopilot": {
        "cleanup_dead_servers": true,
        "last_contact_threshold": "200ms",
        "max_trailing_logs": 250,
        "server_stabilization_time": "10s"
    },
    "bind_addr": "redacted",
    "bootstrap": false,
    "bootstrap_expect": 3,
    "client_addr": "0.0.0.0",
    "data_dir": "/var/consul",
    "datacenter": "dc1",
    "disable_host_node_id": true,
    "disable_update_check": false,
    "dns_config": {
        "allow_stale": true,
        "cache_max_age": "5m",
        "max_stale": "87600h",
        "node_ttl": "30s",
        "service_ttl": {
            "*": "10s"
        },
        "use_cache": true
    },
    "domain": "consul",
    "enable_local_script_checks": true,
    "enable_script_checks": true,
    "encrypt": "redacted",
    "encrypt_verify_incoming": true,
    "encrypt_verify_outgoing": true,
    "limits": {
        "http_max_conns_per_client": 400,
        "rpc_max_conns_per_client": 200
    },
    "log_file": "/var/log/consul/consul.log",
    "log_json": true,
    "log_level": "INFO",
    "log_rotate_bytes": 0,
    "log_rotate_duration": "24h",
    "log_rotate_max_files": 100,
    "node_id": "redacted",
    "node_name": "devops-1",
    "performance": {
        "leave_drain_time": "5s",
        "raft_multiplier": 5,
        "rpc_hold_timeout": "7s"
    },
    "ports": {
        "dns": 8600,
        "grpc": -1,
        "http": 8500,
        "https": -1,
        "serf_lan": 8301,
        "serf_wan": 8302,
        "server": 8300
    },
    "primary_datacenter": "dc1",
    "raft_protocol": 3,
    "retry_interval": "30s",
    "retry_interval_wan": "30s",
    "retry_join": [
        "redacted",
        "redacted",
        "redacted"
    ],
    "retry_max": 0,
    "retry_max_wan": 0,
    "server": true,
    "translate_wan_addrs": false,
    "ui": true
}

This morning dependent services did not die, but consul logs shows few problems:

devops-1: aborting pipeline replication
devops-1: pipelining replication
devops-2: aborting pipeline replication
devops-3: pipelining replication
failed to flush response (write tcp хххх.197:8300-> хххх.126:40433: write: broken pipe)

it all happened at the time when master nodes had higher than typical load: ~5 and higher iowait ~20