hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.41k stars 4.43k forks source link

TTL checks don't fulfill their guarantees when consul is restarted #3569

Open nh2 opened 7 years ago

nh2 commented 7 years ago

consul version for both Client and Server

Client: 0.9.3 Server: 0.9.3

consul info for both Client and Server

Client:

agent:
    check_monitors = 0
    check_ttls = 0
    checks = 0
    services = 0
build:
    prerelease = 
    revision = 
    version = 0.9.3
consul:
    known_servers = 3
    server = false
runtime:
    arch = amd64
    cpu_count = 1
    goroutines = 40
    max_procs = 1
    os = linux
    version = go1.9
serf_lan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 12
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 40
    members = 5
    query_queue = 0
    query_time = 1

Server:

agent:
    check_monitors = 0
    check_ttls = 1
    checks = 1
    services = 1
build:
    prerelease = 
    revision = 
    version = 0.9.3
consul:
    bootstrap = false
    known_datacenters = 1
    leader = true
    leader_addr = 10.0.0.1:8300
    server = true
raft:
    applied_index = 76153
    commit_index = 76153
    fsm_pending = 0
    last_contact = 0
    last_log_index = 76153
    last_log_term = 65
    last_snapshot_index = 74139
    last_snapshot_term = 65
    latest_configuration = [{Suffrage:Voter ID:10.0.0.2:8300 Address:10.0.0.2:8300} {Suffrage:Voter ID:10.0.0.3:8300 Address:10.0.0.3:8300} {Suffrage:Voter ID:10.0.0.1:8300 Address:10.0.0.1:8300}]
    latest_configuration_index = 1
    num_peers = 2
    protocol_version = 2
    protocol_version_max = 3
    protocol_version_min = 0
    snapshot_version_max = 1
    snapshot_version_min = 0
    state = Leader
    term = 65
runtime:
    arch = amd64
    cpu_count = 1
    goroutines = 95
    max_procs = 1
    os = linux
    version = go1.9
serf_lan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 12
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 40
    members = 5
    query_queue = 0
    query_time = 1
serf_wan:
    coordinate_resets = 0
    encrypted = false
    event_queue = 0
    event_time = 1
    failed = 0
    health_score = 0
    intent_queue = 0
    left = 0
    member_time = 22
    members = 3
    query_queue = 0
    query_time = 1

Operating system and Environment details

NixOS

Description of the Issue (and unexpected/desired result)

The Checks page says about Time to Live (TTL) checks:

TTL checks also persist their last known status to disk. This allows the Consul agent to restore the last known status of the check across restarts.

And importantly

Persisted check status is valid through the end of the TTL from the time of the last check.

This suggests that

And in fact this is the only reasonable way I can imagine this feature to be useful.

But the check comes back as passing!

Extract from my journalctl logs on a consul server:

Oct 11 13:17:00.984571 gluster-cluster-1 glusterVolumeRunningCheckWatchdog-start[866]: Online               : Y
Oct 11 13:17:01.027239 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [INFO] agent: Synced check 'service:gluster-volume-distfs'
# Reboot is issued
Oct 11 13:17:01.718544 gluster-cluster-1 systemd[1]: Received SIGINT.
Oct 11 13:17:01.797869 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [INFO] Caught signal:  terminated
Oct 11 13:17:01.797869 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [INFO] Graceful shutdown disabled. Exiting
Oct 11 13:17:01.797869 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [INFO] agent: Requesting shutdown
Oct 11 13:17:01.797869 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [INFO] consul: shutting down server
Oct 11 13:17:01.797869 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [WARN] serf: Shutdown without a Leave
Oct 11 13:17:01.797869 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [INFO] manager: shutting down
Oct 11 13:17:01.797869 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [INFO] raft: aborting pipeline replication to peer {Voter 10.0.0.3:8300 10.0.0.3:8300}
Oct 11 13:17:01.797869 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [INFO] raft: aborting pipeline replication to peer {Voter 10.0.0.2:8300 10.0.0.2:8300}
Oct 11 13:17:01.797869 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [INFO] agent: consul server down
Oct 11 13:17:01.797869 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [INFO] agent: shutdown complete
Oct 11 13:17:01.797869 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [INFO] agent: Stopping DNS server 127.0.0.1:8600 (tcp)
Oct 11 13:17:01.797869 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [INFO] agent: Stopping DNS server 127.0.0.1:8600 (udp)
Oct 11 13:17:01.797869 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [INFO] agent: Stopping HTTP server 127.0.0.1:8500
Oct 11 13:17:01.797869 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [INFO] agent: Waiting for endpoints to shut down
Oct 11 13:17:01.797869 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [INFO] agent: Endpoints down
Oct 11 13:17:01.797869 gluster-cluster-1 consul[753]:     2017/10/11 13:17:01 [INFO] Exit code:  1
Oct 11 13:17:01.799358 gluster-cluster-1 systemd[1]: Stopping consul.service...
...
Oct 11 13:17:07.127014 gluster-cluster-1 systemd-journald[465]: Journal stopped
-- Reboot --
Oct 11 13:17:36.257998 gluster-cluster-1 kernel: Linux version 4.9.52 (nixbld@localhost) (gcc version 6.4.0 (GCC) ) #1-NixOS SMP Wed Sep 27 12:40:05 UTC 2017
...
# consul boots up successfully
...
Oct 11 13:16:36.148334 gluster-cluster-1 consul[753]:     2017/10/11 13:16:36 [INFO] consul: cluster leadership acquired
...
Oct 11 13:17:39.917474 gluster-cluster-1 consul[754]:     2017/10/11 13:17:39 [WARN] agent: Check 'service:gluster-volume-distfs' missed TTL, is now critical

Fromjournalctl on a different machine, a consul client, a /health check is done to see if the check is passing:

Oct 11 13:17:46 gluster-client-1 glusterReadyForClientMount-pre-start[859]: [{'Node': {'ID': '945c37d3-2c94-ed91-2999-3f54b82b30cc', 'Node': 'gluster-cluster-1', 'Address': '10.0.0.1', 'Datacenter': 'dc1', 'TaggedAddresses': {'lan': '10.0.0.1', 'wan': '10.0.0.1'}, 'Meta': {'consul-network-segment': ''}, 'CreateIndex': 57669, 'ModifyIndex': 59501}, 'Service': {'ID': 'gluster-volume-distfs', 'Service': 'gluster-volume-distfs', 'Tags': [], 'Address': '10.0.0.1', 'Port': 0, 'EnableTagOverride': False, 'CreateIndex': 57669, 'ModifyIndex': 57669}, 'Checks': [{'Node': 'gluster-cluster-1', 'CheckID': 'serfHealth', 'Name': 'Serf Health Status', 'Status': 'passing', 'Notes': '', 'Output': 'Agent alive and reachable', 'ServiceID': '', 'ServiceName': '', 'ServiceTags': [], 'CreateIndex': 57669, 'ModifyIndex': 57669}, {'Node': 'gluster-cluster-1', 'CheckID': 'service:gluster-volume-distfs', 'Name': "Service 'gluster-volume-distfs' check", 'Status': 'passing', 'Notes': '', 'Output': 'Command exited with exit code 0 on 2017-10-11 13:17:01.022336:\nset -euo pipefail; /nix/store/0c5ha9n08hw57d612gp7gilg0hh8rqzj-glusterfs-3.10.2/bin/gluster volume status distfs 10.0.0.1:/data/glusterfs/distfs/brick1/brick detail | grep "^Online.*Y"', 'ServiceID': 'gluster-volume-distfs', 'ServiceName': 'gluster-volume-distfs', 'ServiceTags': [], 'CreateIndex': 57669, 'ModifyIndex': 59855}]}]

Important in this long output is only the time stamp; not important but also useful is the Output field that shows that on the server side the last successful check happened at 13:17:01 which is in line with what the journalctl of the server where that happened say. (Also note I set check_update_interval = "1ns"; just to make sure that the output is not outdated; but as I said the output is only informational and the bug would still be clear without it.)


Summary

From the two logs above we can see that for a check with TTL = 2 seconds:

This means consul answered a TTL check query with "passing" 43 seconds after the TTL expired.

Reproduction steps

Create a check with 2 seconds TTL, set up a daemon that starts at boot that marks the check as passing after 60 seconds. Wait for the check to go green. Reboot. Then look at consul's logs.

Possible solution

It seems that consul should check whether the TTL is expired when answering a query about a TTL check.

slackpad commented 7 years ago

Hi @nh2, when saving of state was implemented for health checks it was originally designed to fix kind of the opposite problem where some check with a very long TTL shouldn't come up failed again and have to wait. It should be a pretty simple addition to look at the timestamp and determine that the agent has been down longer than the TTL for a short-ish TTL and know to set the initial state to failed when it comes back up.

I wouldn't want to do this until https://github.com/hashicorp/consul/pull/3391 is done, since that's refactoring the local state handling where this would live.