Open nh2 opened 7 years ago
Hi @nh2, when saving of state was implemented for health checks it was originally designed to fix kind of the opposite problem where some check with a very long TTL shouldn't come up failed again and have to wait. It should be a pretty simple addition to look at the timestamp and determine that the agent has been down longer than the TTL for a short-ish TTL and know to set the initial state to failed when it comes back up.
I wouldn't want to do this until https://github.com/hashicorp/consul/pull/3391 is done, since that's refactoring the local state handling where this would live.
consul version
for both Client and ServerClient:
0.9.3
Server:0.9.3
consul info
for both Client and ServerClient:
Server:
Operating system and Environment details
NixOS
Description of the Issue (and unexpected/desired result)
The Checks page says about Time to Live (TTL) checks:
And importantly
This suggests that
And in fact this is the only reasonable way I can imagine this feature to be useful.
But the check comes back as passing!
Extract from my
journalctl
logs on a consul server:From
journalctl
on a different machine, a consul client, a/health
check is done to see if the check is passing:Important in this long output is only the time stamp; not important but also useful is the
Output
field that shows that on the server side the last successful check happened at13:17:01
which is in line with what the journalctl of the server where that happened say. (Also note I setcheck_update_interval = "1ns";
just to make sure that the output is not outdated; but as I said the output is only informational and the bug would still be clear without it.)Summary
From the two logs above we can see that for a check with TTL = 2 seconds:
13:17:01
the check was last marked as passing13:16:36
consul was up and working again13:17:46
the consul server tells the client that the TTL check is passing13:17:39
the check was marked as TTL missedThis means consul answered a TTL check query with "passing" 43 seconds after the TTL expired.
Reproduction steps
Create a check with 2 seconds TTL, set up a daemon that starts at boot that marks the check as passing after 60 seconds. Wait for the check to go green. Reboot. Then look at consul's logs.
Possible solution
It seems that consul should check whether the TTL is expired when answering a query about a TTL check.