hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

Force restart of consul during startup results in zero byte node-id file #3489

Open mtimm opened 7 years ago

mtimm commented 7 years ago

consul version for both Client and Server

Client: 0.7.5 Server: 0.7.5

consul info for both Client and Server

Client:

agent:
    check_monitors = 2
    check_ttls = 0
    checks = 4
    services = 3
build:
    prerelease =
    revision = '21f2d5a
    version = 0.7.5
consul:
    known_servers = 3
    server = false
runtime:
    arch = amd64
    cpu_count = 2
    goroutines = 43
    max_procs = 2
    os = linux
    version = go1.7.5
serf_lan:
    encrypted = false
    event_queue = 0
    event_time = 8
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 13068
    members = 62
    query_queue = 0
    query_time = 1

Server:

agent:
    check_monitors = 1
    check_ttls = 1
    checks = 5
    services = 6
build:
    prerelease =
    revision = '21f2d5a
    version = 0.7.5
consul:
    bootstrap = false
    known_datacenters = 1
    leader = true
    leader_addr = 1.1.1.252:8300
    server = true
raft:
    applied_index = 9365
    commit_index = 9365
    fsm_pending = 0
    last_contact = never
    last_log_index = 9365
    last_log_term = 2
    last_snapshot_index = 8194
    last_snapshot_term = 2
    latest_configuration = [{Suffrage:Voter ID:1.1.1.252:8300 Address:1.1.1.252:8300} {Suffrage:Voter ID:1.1.1.253:8300 Address:1.1.1.253:8300} {Suffrage:Voter ID:1.1.1.254:8300 Address:1.1.1.254:8300}]
    latest_configuration_index = 1
    num_peers = 2
    protocol_version = 1
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Leader
    term = 2
runtime:
    arch = amd64
    cpu_count = 4
    goroutines = 286
    max_procs = 4
    os = linux
    version = go1.7.5
serf_lan:
    encrypted = false
    event_queue = 0
    event_time = 8
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 13068
    members = 62
    query_queue = 0
    query_time = 1
serf_wan:
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 1
    members = 1
    query_queue = 0
    query_time = 1

Operating system and Environment details

Server: Centos 7.2 Clients: Centos 7.2, Centos6.5, Ubuntu 14

Description of the Issue (and unexpected/desired result)

From time to time consul fails to start with:

2017-09-21_17:56:34.26259 ==> Starting Consul agent...
2017-09-21_17:56:34.26373 ==> Error starting agent: Failed to setup node ID: uuid string is wrong length

Looking at the node-id file it is zero bytes. When we remove the node-id file and consul starts correctly.

Expected behavior: when consul starts the check to create the node-id file could be such that if the file doesn't exist or is empty, then create it.

Reproduction steps

It appears this occurs when consul is in the process of starting and is force restarted mid-startup. This is a fairly rare failure likely due to how we deploy consul (generate the config files from templates and force-restart consul).

Log Fragments

We run consul under runit so a majority of this output is from the run script. Looking at some other deploy logs I see runit force-restarting consul after which time consul fails to start.

2017-09-21_17:13:35.52594 + '[' -f /etc/sysconfig/consul ']'
2017-09-21_17:13:35.52599 + . /etc/sysconfig/consul
2017-09-21_17:13:35.52604 ++ INTERFACE=eth0
2017-09-21_17:13:35.52611 + '[' -f /etc/default/consul ']'
2017-09-21_17:13:35.52617 + CONSUL_FLAGS=
2017-09-21_17:13:35.52622 + INTERFACE=eth0
2017-09-21_17:13:35.52632 + for i in '$INTERFACE'
2017-09-21_17:13:35.52665 + ip -f inet addr show dev eth0 primary
2017-09-21_17:13:35.53182 + grep -Pq 'inet [0-9.]+'
2017-09-21_17:13:35.53280 ++ grep -Po 'inet \K[0-9.]+'
2017-09-21_17:13:35.53345 ++ ip -f inet addr show dev eth0 primary
2017-09-21_17:13:35.53421 ++ head -n 1
2017-09-21_17:13:35.53478 ++ grep -v /32
2017-09-21_17:13:35.53584 + BIND=1.1.1.26
2017-09-21_17:13:35.53585 + break
2017-09-21_17:13:35.53585 + '[' -z 1.1.1.26 ']'
2017-09-21_17:13:35.53999 ++ nproc
2017-09-21_17:13:35.73817 + export GOMAXPROCS=2
2017-09-21_17:13:35.73818 + GOMAXPROCS=2
2017-09-21_17:13:35.73819 + ulimit -n 65535
2017-09-21_17:13:35.73819 + exec nice -n -10 ionice -c 1 /usr/bin/consul agent -config-dir=/etc/tetration/consul.d/configs -bind=1.1.1.26 -advertise=1.1.1.26
2017-09-21_17:13:35.73819 ==> Starting Consul agent...
2017-09-21_17:13:45.74136 + '[' -f /etc/sysconfig/consul ']'
2017-09-21_17:13:45.74140 + . /etc/sysconfig/consul
2017-09-21_17:13:45.74144 ++ INTERFACE=eth0
2017-09-21_17:13:45.74148 + '[' -f /etc/default/consul ']'
2017-09-21_17:13:45.74152 + CONSUL_FLAGS=
2017-09-21_17:13:45.74155 + INTERFACE=eth0
2017-09-21_17:13:45.74162 + for i in '$INTERFACE'
2017-09-21_17:13:45.74182 + ip -f inet addr show dev eth0 primary
2017-09-21_17:13:45.74194 + grep -Pq 'inet [0-9.]+'
2017-09-21_17:13:45.74296 ++ ip -f inet addr show dev eth0 primary
2017-09-21_17:13:45.74335 ++ head -n 1
2017-09-21_17:13:45.74380 ++ grep -Po 'inet \K[0-9.]+'
2017-09-21_17:13:45.74392 ++ grep -v /32
2017-09-21_17:13:45.74489 + BIND=1.1.1.26
2017-09-21_17:13:45.74491 + break
2017-09-21_17:13:45.74492 + '[' -z 1.1.1.26 ']'
2017-09-21_17:13:45.74504 ++ nproc
2017-09-21_17:13:45.74557 + export GOMAXPROCS=2
2017-09-21_17:13:45.74558 + GOMAXPROCS=2
2017-09-21_17:13:45.74558 + ulimit -n 65535
2017-09-21_17:13:45.74559 + exec nice -n -10 ionice -c 1 /usr/bin/consul agent -config-dir=/etc/tetration/consul.d/configs -bind=1.1.1.26 -advertise=1.1.1.26
2017-09-21_17:13:45.76214 ==> Starting Consul agent...
2017-09-21_17:13:45.76303 ==> Error starting agent: Failed to setup node ID: uuid string is wrong length
slackpad commented 6 years ago

Seems like we need to swap that file after it has been written, similar to what we do with other files generated by Consul (seems like it's worth a helper lib).