Open onnimonni opened 9 years ago
@onnimonni We never considered the case of such long intervals. Most of the checks are definited on fairly short intervals. We purposely stagger start the checks to avoid a thundering heard, but in this case the stagger may be spread over a 30 minute period which is rather long. Tagging as bug, so we can add a cap to the stagger period.
I'm running rather intensive security scanning check for wordpress using consul and it's ok if node is under a pressure during the start for this, but after that it's ok to run it only seldomly
Thanks for considering it :)
On Tue, May 19, 2015 at 1:43 AM, Armon Dadgar notifications@github.com wrote:
@onnimonni We never considered the case of such long intervals. Most of the checks are definited on fairly short intervals. We purposely stagger start the checks to avoid a thundering heard, but in this case the stagger may be spread over a 30 minute period which is rather long. Tagging as bug, so we can add a cap to the stagger period.
Reply to this email directly or view it on GitHub: https://github.com/hashicorp/consul/issues/954#issuecomment-103238880
We've been bitten by this too. We run some low-priority host configuration checks intermittently. Critical checks cause the node to go offline in DNS, which causes problems across the board.
I wonder if this problem could be fixed by having a special case - the first round of health checks could be run immediately after the node is registered in Consul, and the timer can begin from there.
@onnimonni Solution posted in issue #1085: "[#962] actually should solve this case by allowing the ability to provide the check status when registering the check. The confusion is probably that we forgot to document it on the web docs, so we can probably just create a ticket for that instead. @siddharthist does that satisfy your use case?"
Solution in #1085 is okay, but I would still like to run these tests asap in the first-run even if it causes thundering heard.
Could we have it as variable?
{
"check": {
"id": "mem",
"script": "/bin/check_mem",
"interval": "10s",
"allow-thundering-heard": "yes-please"
}
}
Just kidding, but for me it would be really useful to run some tests once in start and then do them only once in a while later. This could be used to prioritize checks as well.
Hmm. What if we just bound the maximum stagger period to say 60 seconds? So even with a 30m check it will be run in a relatively timely manner. I prefer sane default behavior over having a million knobs to turn.
:+1: for this ticket. I would like to have faster feedback while writing/debugging hourly checks. Alternatively, it would be great to have a API or CLI command to force re-execution of a check.
Is it possible to allow an optional parameter to specify the maximum interval which the first check should arrive? So you can set the interval to 30 minutes but specifying the first check should arrive within 1 second after registering the service for instance? Much like when you schedule a task at fixed rate in Java where you can specify an initial delay and a re-occurring time period. If such initial delay is set to 0, then the first health check should be done immediately after registration.
I would also really appreciate a solution to this problem. We have a similar use-case were we'd like to be able to force a host to run it's checks ASAP. The goal for us is to be able to prove the status of a subset of our fleet before making a change to a member of it (for example: we monitor the health of a pair of network gateways. Before taking one out of service, we'd like to ensure that there are no problems on the other).
I would also really appreciate a solution to this problem. We currently have checks that run once a day :-(. The solution outlined by @armon like 3 years ago seems reasonable, not?
maximum stagger period to say 60 seconds
Or make it configurable. Default to what it is now but allow me to change to say 60 seconds or whatever in the consul agents config.
Alternatively if the agent checks http api would be writable I could at least work around the issue by running my checks manually and patching the check status.
Something like:
status="$(run-check-manually my-check-id)"
curl \
--request PATCH \
--data "$(printf '{"Status": "%s"}' "$status")" \
https://consul.rocks/v1/agent/check/update/my-check-id
Unfortunately the above currently does not work as the /check/update/ endpoint only works for checks of type TTL :-(
@onnimonni We purposely stagger start the checks to avoid a thundering heard
@armon For normal shorter interval checks like very 30 seconds, what is the initial delay bound after staggering?
When I start one node of the cluster all of the checks start at the state critical. Some checks have really long intervals ( 30minutes ) and they will be critical until the first check. Is this by design?
Do you know if there are solutions to run tests more often when they're failing aka in 'critical' state?