Run "critical" health checks immediately when node is restarted/reloaded

hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.

https://www.consul.io

Other

28.41k stars 4.43k forks source link

Run "critical" health checks immediately when node is restarted/reloaded #954

Open onnimonni opened 9 years ago

onnimonni commented 9 years ago

When I start one node of the cluster all of the checks start at the state critical. Some checks have really long intervals ( 30minutes ) and they will be critical until the first check. Is this by design?

Do you know if there are solutions to run tests more often when they're failing aka in 'critical' state?

armon commented 9 years ago

@onnimonni We never considered the case of such long intervals. Most of the checks are definited on fairly short intervals. We purposely stagger start the checks to avoid a thundering heard, but in this case the stagger may be spread over a 30 minute period which is rather long. Tagging as bug, so we can add a cap to the stagger period.

onnimonni commented 9 years ago

I'm running rather intensive security scanning check for wordpress using consul and it's ok if node is under a pressure during the start for this, but after that it's ok to run it only seldomly

Thanks for considering it :)

On Tue, May 19, 2015 at 1:43 AM, Armon Dadgar notifications@github.com wrote:

@onnimonni We never considered the case of such long intervals. Most of the checks are definited on fairly short intervals. We purposely stagger start the checks to avoid a thundering heard, but in this case the stagger may be spread over a 30 minute period which is rather long. Tagging as bug, so we can add a cap to the stagger period.

Reply to this email directly or view it on GitHub: https://github.com/hashicorp/consul/issues/954#issuecomment-103238880

stevendborrelli commented 9 years ago

We've been bitten by this too. We run some low-priority host configuration checks intermittently. Critical checks cause the node to go offline in DNS, which causes problems across the board.

langston-barrett commented 9 years ago

I wonder if this problem could be fixed by having a special case - the first round of health checks could be run immediately after the node is registered in Consul, and the timer can begin from there.

langston-barrett commented 9 years ago

@onnimonni Solution posted in issue #1085: "[#962] actually should solve this case by allowing the ability to provide the check status when registering the check. The confusion is probably that we forgot to document it on the web docs, so we can probably just create a ticket for that instead. @siddharthist does that satisfy your use case?"

onnimonni commented 9 years ago

Solution in #1085 is okay, but I would still like to run these tests asap in the first-run even if it causes thundering heard.

Could we have it as variable?

{
  "check": {
    "id": "mem",
    "script": "/bin/check_mem",
    "interval": "10s",
    "allow-thundering-heard": "yes-please"
  }
}

Just kidding, but for me it would be really useful to run some tests once in start and then do them only once in a while later. This could be used to prioritize checks as well.

armon commented 9 years ago

Hmm. What if we just bound the maximum stagger period to say 60 seconds? So even with a 30m check it will be run in a relatively timely manner. I prefer sane default behavior over having a million knobs to turn.

jvoorhis commented 9 years ago

:+1: for this ticket. I would like to have faster feedback while writing/debugging hourly checks. Alternatively, it would be great to have a API or CLI command to force re-execution of a check.

qingatqlik commented 9 years ago

Is it possible to allow an optional parameter to specify the maximum interval which the first check should arrive? So you can set the interval to 30 minutes but specifying the first check should arrive within 1 second after registering the service for instance? Much like when you schedule a task at fixed rate in Java where you can specify an initial delay and a re-occurring time period. If such initial delay is set to 0, then the first health check should be done immediately after registration.

c-robinson commented 7 years ago

I would also really appreciate a solution to this problem. We have a similar use-case were we'd like to be able to force a host to run it's checks ASAP. The goal for us is to be able to prove the status of a subset of our fleet before making a change to a member of it (for example: we monitor the health of a pair of network gateways. Before taking one out of service, we'd like to ensure that there are no problems on the other).

asteven commented 6 years ago

I would also really appreciate a solution to this problem. We currently have checks that run once a day :-(. The solution outlined by @armon like 3 years ago seems reasonable, not?

maximum stagger period to say 60 seconds

Or make it configurable. Default to what it is now but allow me to change to say 60 seconds or whatever in the consul agents config.

Alternatively if the agent checks http api would be writable I could at least work around the issue by running my checks manually and patching the check status.

Something like:

status="$(run-check-manually my-check-id)"
curl \
    --request PATCH \
    --data "$(printf '{"Status": "%s"}' "$status")" \
    https://consul.rocks/v1/agent/check/update/my-check-id

Unfortunately the above currently does not work as the /check/update/ endpoint only works for checks of type TTL :-(

zorro786 commented 3 years ago

@onnimonni We purposely stagger start the checks to avoid a thundering heard

@armon For normal shorter interval checks like very 30 seconds, what is the initial delay bound after staggering?