Open pySilver opened 3 years ago
In a meantime I've adjusted some settings like: limits & dns_config & set performance.raft_multiplier from 1 to 5; resulting master configs looking the following way:
{
"acl": {
"default_policy": "allow",
"down_policy": "extend-cache",
"enable_token_persistence": true,
"enabled": true,
"token_ttl": "30s",
"tokens": {
"agent": "redacted",
"default": "redacted",
"master": "redacted",
"replication": "redacted"
}
},
"addresses": {
"dns": "0.0.0.0",
"grpc": "0.0.0.0",
"http": "0.0.0.0",
"https": "0.0.0.0"
},
"advertise_addr": "redacted",
"advertise_addr_wan": "redacted",
"autopilot": {
"cleanup_dead_servers": true,
"last_contact_threshold": "200ms",
"max_trailing_logs": 250,
"server_stabilization_time": "10s"
},
"bind_addr": "redacted",
"bootstrap": false,
"bootstrap_expect": 3,
"client_addr": "0.0.0.0",
"data_dir": "/var/consul",
"datacenter": "dc1",
"disable_host_node_id": true,
"disable_update_check": false,
"dns_config": {
"allow_stale": true,
"cache_max_age": "5m",
"max_stale": "87600h",
"node_ttl": "30s",
"service_ttl": {
"*": "10s"
},
"use_cache": true
},
"domain": "consul",
"enable_local_script_checks": true,
"enable_script_checks": true,
"encrypt": "redacted",
"encrypt_verify_incoming": true,
"encrypt_verify_outgoing": true,
"limits": {
"http_max_conns_per_client": 400,
"rpc_max_conns_per_client": 200
},
"log_file": "/var/log/consul/consul.log",
"log_json": true,
"log_level": "INFO",
"log_rotate_bytes": 0,
"log_rotate_duration": "24h",
"log_rotate_max_files": 100,
"node_id": "redacted",
"node_name": "devops-1",
"performance": {
"leave_drain_time": "5s",
"raft_multiplier": 5,
"rpc_hold_timeout": "7s"
},
"ports": {
"dns": 8600,
"grpc": -1,
"http": 8500,
"https": -1,
"serf_lan": 8301,
"serf_wan": 8302,
"server": 8300
},
"primary_datacenter": "dc1",
"raft_protocol": 3,
"retry_interval": "30s",
"retry_interval_wan": "30s",
"retry_join": [
"redacted",
"redacted",
"redacted"
],
"retry_max": 0,
"retry_max_wan": 0,
"server": true,
"translate_wan_addrs": false,
"ui": true
}
This morning dependent services did not die, but consul logs shows few problems:
devops-1: aborting pipeline replication
devops-1: pipelining replication
devops-2: aborting pipeline replication
devops-3: pipelining replication
failed to flush response (write tcp хххх.197:8300-> хххх.126:40433: write: broken pipe)
it all happened at the time when master nodes had higher than typical load: ~5 and higher iowait ~20
Overview of the Issue
I'm having periodic (everyday!) problem when Consul instances became unavailable for short period of time ~1-2min and I'm failed to investigate what is the cause. This downtime causes dependent services crash or restart (by Nomad) so it's kinda big deal for us.
Topology: 12 bare metal machines (with consul agents) + 3 cloud instances for consul servers.
At least once a day connection between consul nodes got lost for short period of time (its not a network issue, I'm sure!). See the logs below.
What can cause this?
Reproduction Steps
Unknown at this moment.
Consul info for both Client and Server
Client info
``` agent: check_monitors = 0 check_ttls = 0 checks = 3 services = 3 build: prerelease = revision = 6530cf37 version = 1.9.2 consul: acl = enabled known_servers = 3 server = false runtime: arch = amd64 cpu_count = 16 goroutines = 72 max_procs = 16 os = linux version = go1.15.6 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 12 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 106 members = 15 query_queue = 0 query_time = 1 ```Server info
``` agent: check_monitors = 0 check_ttls = 1 checks = 7 services = 7 build: prerelease = revision = 6530cf37 version = 1.9.2 consul: acl = enabled bootstrap = false known_datacenters = 1 leader = false leader_addr = redacted:8300 server = true raft: applied_index = 257845 commit_index = 257845 fsm_pending = 0 last_contact = 30.443882ms last_log_index = 257845 last_log_term = 30 last_snapshot_index = 245903 last_snapshot_term = 30 latest_configuration = [{Suffrage:Voter ID:2525b2d0-e6f2-59cd-83aa-a4dcfd708667 Address:redacted:8300} {Suffrage:Voter ID:d70579fd-e1f1-5359-b6fe-17a69897fe16 Address:redacted:8300} {Suffrage:Voter ID:32b006a1-fc82-51b1-86bd-c91b93d3cd49 Address:redacted:8300}] latest_configuration_index = 0 num_peers = 2 protocol_version = 3 protocol_version_max = 3 protocol_version_min = 0 snapshot_version_max = 1 snapshot_version_min = 0 state = Follower term = 30 runtime: arch = amd64 cpu_count = 2 goroutines = 157 max_procs = 2 os = linux version = go1.15.6 serf_lan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 12 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 106 members = 15 query_queue = 0 query_time = 1 serf_wan: coordinate_resets = 0 encrypted = true event_queue = 0 event_time = 1 failed = 0 health_score = 0 intent_queue = 0 left = 0 member_time = 9 members = 3 query_queue = 0 query_time = 1 ```Operating system and Environment details
Ubuntu 20.04.1 LTS Linux devops-1 5.4.0-62-generic #70-Ubuntu SMP Tue Jan 12 12:45:47 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Log Fragments
Cumulative log from all nodes, latest incident lasted ~1 min: