Closed nook24 closed 6 years ago
Here are a few other examples, what happens if i just reload my naemon core a few times: check_interval=7200 interval_length=1
last_check=1477979245 => 01.11.2016 06:47:25
next_check=1477988845 => 01.11.2016 09:27:25
diff: 9600
last_check=1477979245 => 01.11.2016 06:47:25
next_check=1477986516 => 01.11.2016 08:48:36
diff: 7271
last_check=1477979245 => 01.11.2016 06:47:25
next_check=1477985363 => 01.11.2016 08:29:23
diff: 6118
last_check=1477979245 => 01.11.2016 06:47:25
next_check=1477986233 => 01.11.2016 08:43:53
diff: 6988
last_check=1477979245 => 01.11.2016 06:47:25
next_check=1477990720 => 01.11.2016 09:58:40
diff: 11475
I don't really get, why next_check gets recalculated on every reload...
Hi, i'm on naemon 1.0.3 In my status.dat services and hosts are rescheduled correct. (last/next_check). We always do restart instead of reload... but also multiple times a day.
Maybe you should use the default interval_length of 60 and give this a try
INTERVAL LENGTH This is the seconds per unit interval as used in the host/contact/service configuration files. Setting this to 60 means that each interval is one minute long (60 seconds). Other settings have not been tested much, so your mileage is likely to vary...
interval_length=60
The check_interval is a guideline of how often the check should be executed, but no promise over restarts.
In fact, lets say you have 100000 checks every 300 seconds, and you reboot your monitoring server, and it goes down for 150 seconds. That means half of the checks should be executed within the first 150 seconds, the other half should already have been checked, thus, it should hurry with them as fast as possible, or skip a check or similar. Lets say it has given up with using the old values and spread it evenly.
Thus, we end up with 3/4:s of the checks executing the first 150 seconds, and 1/4 the next 150 seconds, then it will go up again to 3/4 of the checks, since the first is going to be rechecked.
So there is an assumption that restarts happens so seldom, that there are several check intervals between each restart. Thus, all checks are randomly spread out on start, providing an even load on the system.
Thus, it doesn't work that well of having a check_interval at a couple of hours, if it's not ok to miss a check sometimes, even though it should be quite seldom depending on restart frequency.
I believe that it's better to just run the check more often, to get better response time, if response time your aiming at.
Maybe you should use the default interval_length of 60 and give this a try
The interval_length is just multiplied with the check_interval when adding the next event to the main loop. Thus, having interval_length=60 and check_interval=5 is the same as interval_length=1 and check_interval=300 (at least should be. If it isn't, file a bug)
wasn't aware of this. thx for the info
I guess the ranged_urand(0, get_host_check_interval_s(temp_host)) part is used for load balancing the system, but the random range is way to big, especially in my example where my host has a check interval of 2h. But the same code is inside of checks_service.c and services sometimes have 1d as check_interval, for example for SSL certs.
It's not to big. There is no value that works for all. The only way to get an even load of having N checks on M seconds is to have an even check density of N/M checks per second. Shortening the time range of that random number will change the density in the beginning of the check cycle to N/K checks per second, where K is less than M, thus N/K is higher than N/M. Thus, a higher load for a while.
A good algorithm for random numbers has an even density, thus, we should place every check with a random time with even distribution within a check_interval length to keep an even distribution over time.
Regarding SSL checks, it shouldn't be hard for the system to check every 1h, even for many certs. I believe the resource cost of having non-even distribution would be higher than rechecking SSL certs little more often than once every 1d, for example about every hour (which can be up to 2h sometimes)
The issue i run into is, my master naemon has set additional_freshness_latency=300
in naemon.cfg. So my distributed systems get a time bonus of 5 minutes to compensate load peaks, bad connections, system clock differences, check latency and so on...
Of course a check_interval of 1d is not that smart, but if i reduce it to 2h i still have the same issue. If i reload/restart my master naemon and my distributed naemon, both instance will create new and different random values for next_check.
So my master naemon may be get panic and will fire up a freshness_check, before my distributed naemon instance think its time to reschedule the check. And i guess to only way to fix this, is to set additional_freshness_latency to a high value, which is bad because i don't see that the check is old because of lost connection. Or i reduce the check_intervall, but this also will not fix the issue, as long as my check_interval is > 5minutes.
What i don't really get is, why does naemon recalculate the next_check on restart? I guess this only make sense, if next_check is 0, or not? //Edit Or if next_check is < NOW on naemon startup
In that case, it sounds more like a problem with the distribution. What do you use for distributing naemon then?
We need to recalculate the next check due to not get huge load spikes frequently after restarting the system.
@pengi I crate a passive service on my master system like this:
define host{
use efbee68c-cf48-4b78-83f5-c856c56177f0
host_name 8ad25d51-6a3f-4e4c-bd2e-1b23d48872d0
display_name staging-sat-2000
address 127.0.0.1
alias default host
;Check settings:
check_command 2106cf0bf26a82af262c4078e6d9f94eded84d2a ;check_fresh
check_period 41012866-6114-4853-9caf-6ffd19954e50 ;24x7
check_interval 600
retry_interval 60
max_check_attempts 3
active_checks_enabled 0
check_freshness 1
freshness_threshold 900
;Notification settings:
notification_interval 7200
notification_period 41012866-6114-4853-9caf-6ffd19954e50
notification_options u,r
;Flap detection settings:
flap_detection_enabled 0
;Everything else:
}
On the distributed instance my config looks like this:
define host{
use efbee68c-cf48-4b78-83f5-c856c56177f0
host_name 8ad25d51-6a3f-4e4c-bd2e-1b23d48872d0
display_name staging-sat-2000
address 127.0.0.1
alias default host
;Check settings:
check_command 5a538ebc-03de-4ce6-8e32-665b841abde3
check_period 41012866-6114-4853-9caf-6ffd19954e50
check_interval 600
retry_interval 60
max_check_attempts 3
active_checks_enabled 1
;Notification settings:
notification_interval 7200
notification_period 41012866-6114-4853-9caf-6ffd19954e50
notification_options u,r
;Flap detection settings:
flap_detection_enabled 0
;Everything else:
}
You can ignore the uuid stuff, its just because of auto-generated configs.
The distributed naemon instance is using an ochp/ocsp command to write the results to a local gearman queue. On the master system there is basically a while(true) running, fetching results out of the remote gearman-job-server and write them to the local naemon.qh. This is running for a few years right now... (in later versions with naemon.cmd or checkresult spool dir)
In this screenshot you can see, that the distributed system was reporting the check results without any issues (5:17am - 6:17am 31.10.16)
(6:22am - 8:47am 31.10.16)
And here you can see, that due to some naemon restarts, i get "Service is no longer current" messages which are caused by check_fresh.
This are the restarts out of the master naemon.log:
[31.10.2016 - 09:16:59] Naemon 1.0.5-source starting... (PID=27729)
[31.10.2016 - 09:43:30] Caught 'Hangup', restarting...
[31.10.2016 - 09:43:34] Naemon 1.0.5-source starting... (PID=27729)
Than it takes some time, until the system is running normal again. On this example system it is not that bad, because every service has a check_interval of 300secs and the hosts are checked every 600secs
So thats not a problem of the distributed solution. The issue is caused, that the last check_result in the master naemon is to old, due to the huge difference between last_check and next_check values :(
PS: Every time a reload of the master naemon is done, a few seconds later the distributed naemon gets automatically reloaded as well.
But this is not an issue of the DM itsself but due to the freshness you just notice the the check gets scheduled to unexpected timings
I think we should just keep the next_check over restarts and only reschedule them a few seconds if the next_check is in the past already, which normally should only happen if you stop naemon for a few minutes. This would also make restarts even faster. Also we should cap the reschedule window by lets say 5 minutes.
Any update for this? :)
merlin also complains "mon check distribution" with expired checks. Seems like the other peers are awaiting results from the restarting peer. To keep this list cleaned up fast, i think the idea of sni would be nice.
Hi, yesterday i was asking in the IRC channel, if there is a known issue with host rescheduling in Naemon 1.0.3 or 1.0.5.
I noticed that the distance between last_check and next_check in status.dat and retantion.dat are much more than check_interval allows.
For Example:
I also tried with
interval_length=60
but this causes the same issue. Even i noticed, thatnext_check
was calculated new on every Naemon restart/reload but it was not just last_check + check_interval or NOW() + check_interval, the next_check timestamp was set to random values i didn't unterstand.So i started some investigation and found in
checks_init_hosts
inchecks_host.c
the following code:I guess the
ranged_urand(0, get_host_check_interval_s(temp_host))
part is used for load balancing the system, but the random range is way to big, especially in my example where my host has a check interval of 2h. But the same code is inside ofchecks_service.c
and services sometimes have 1d as check_interval, for example for SSL certs.I only notice this, due to distributed monitoring. At night all systems were running well and report there check results to the naemon master. But on a average working day, the master randomly fires the
check_fresh
command. So i started to take a look on the system and found that next_check > last_check + check_interval. With every naemon restart it get worse and worse 🙈But i was able to force an example, which is much more worse than the given above: check_interval=120
As you can see, just with restarting Naemon i can avoid that checks with a long check_interval will be executed, if I am lucky with urand()