Host- and Servicechecks maybe never executed

nook24 commented 7 years ago

Hi, yesterday i was asking in the IRC channel, if there is a known issue with host rescheduling in Naemon 1.0.3 or 1.0.5.

I noticed that the distance between last_check and next_check in status.dat and retantion.dat are much more than check_interval allows.

For Example:

last_check=1477914906 => 12:55:06(CET)
next_check=1477924669 => 15:37:49(CET)
check_invervall=7200
interval_length=1

Diff of the timestams: 9763

I also tried with interval_length=60 but this causes the same issue. Even i noticed, that next_check was calculated new on every Naemon restart/reload but it was not just last_check + check_interval or NOW() + check_interval, the next_check timestamp was set to random values i didn't unterstand.

So i started some investigation and found in checks_init_hosts in checks_host.c the following code:

for (temp_host = host_list; temp_host != NULL; temp_host = temp_host->next) {

        /* update status of all hosts (scheduled or not) */
        update_host_status(temp_host, FALSE);

        /* schedule a new host check event */
        schedule_next_host_check(temp_host, ranged_urand(0, get_host_check_interval_s(temp_host)), CHECK_OPTION_NONE);
    }

I guess the ranged_urand(0, get_host_check_interval_s(temp_host)) part is used for load balancing the system, but the random range is way to big, especially in my example where my host has a check interval of 2h. But the same code is inside of checks_service.c and services sometimes have 1d as check_interval, for example for SSL certs.

I only notice this, due to distributed monitoring. At night all systems were running well and report there check results to the naemon master. But on a average working day, the master randomly fires the check_fresh command. So i started to take a look on the system and found that next_check > last_check + check_interval. With every naemon restart it get worse and worse 🙈

But i was able to force an example, which is much more worse than the given above: check_interval=120

Naemon Restart:
last_check=1477562797 => Thu Oct 27 12:06:37 CEST 2016
next_check=1477931604 => Mon Oct 31 17:33:24 CET 2016

Naemon Restart
last_check=1477562797 => Thu Oct 27 12:06:37 CEST 2016
next_check=1477936368 => Mon Oct 31 18:52:48 CET 2016

As you can see, just with restarting Naemon i can avoid that checks with a long check_interval will be executed, if I am lucky with urand()

nook24 commented 7 years ago

Here are a few other examples, what happens if i just reload my naemon core a few times: check_interval=7200 interval_length=1

last_check=1477979245 => 01.11.2016 06:47:25
next_check=1477988845 => 01.11.2016 09:27:25
diff: 9600

last_check=1477979245 => 01.11.2016 06:47:25
next_check=1477986516 => 01.11.2016 08:48:36
diff: 7271

last_check=1477979245 => 01.11.2016 06:47:25
next_check=1477985363 => 01.11.2016 08:29:23
diff: 6118

last_check=1477979245 => 01.11.2016 06:47:25
next_check=1477986233 => 01.11.2016 08:43:53
diff: 6988

last_check=1477979245 => 01.11.2016 06:47:25
next_check=1477990720 => 01.11.2016 09:58:40
diff: 11475

I don't really get, why next_check gets recalculated on every reload...

sponkydive commented 7 years ago

Hi, i'm on naemon 1.0.3 In my status.dat services and hosts are rescheduled correct. (last/next_check). We always do restart instead of reload... but also multiple times a day.

Maybe you should use the default interval_length of 60 and give this a try

INTERVAL LENGTH This is the seconds per unit interval as used in the host/contact/service configuration files. Setting this to 60 means that each interval is one minute long (60 seconds). Other settings have not been tested much, so your mileage is likely to vary...

interval_length=60

pengi commented 7 years ago

The check_interval is a guideline of how often the check should be executed, but no promise over restarts.

In fact, lets say you have 100000 checks every 300 seconds, and you reboot your monitoring server, and it goes down for 150 seconds. That means half of the checks should be executed within the first 150 seconds, the other half should already have been checked, thus, it should hurry with them as fast as possible, or skip a check or similar. Lets say it has given up with using the old values and spread it evenly.

Thus, we end up with 3/4:s of the checks executing the first 150 seconds, and 1/4 the next 150 seconds, then it will go up again to 3/4 of the checks, since the first is going to be rechecked.

So there is an assumption that restarts happens so seldom, that there are several check intervals between each restart. Thus, all checks are randomly spread out on start, providing an even load on the system.

Thus, it doesn't work that well of having a check_interval at a couple of hours, if it's not ok to miss a check sometimes, even though it should be quite seldom depending on restart frequency.

I believe that it's better to just run the check more often, to get better response time, if response time your aiming at.

pengi commented 7 years ago

Maybe you should use the default interval_length of 60 and give this a try

The interval_length is just multiplied with the check_interval when adding the next event to the main loop. Thus, having interval_length=60 and check_interval=5 is the same as interval_length=1 and check_interval=300 (at least should be. If it isn't, file a bug)

sponkydive commented 7 years ago

wasn't aware of this. thx for the info

pengi commented 7 years ago

I guess the ranged_urand(0, get_host_check_interval_s(temp_host)) part is used for load balancing the system, but the random range is way to big, especially in my example where my host has a check interval of 2h. But the same code is inside of checks_service.c and services sometimes have 1d as check_interval, for example for SSL certs.

It's not to big. There is no value that works for all. The only way to get an even load of having N checks on M seconds is to have an even check density of N/M checks per second. Shortening the time range of that random number will change the density in the beginning of the check cycle to N/K checks per second, where K is less than M, thus N/K is higher than N/M. Thus, a higher load for a while.

A good algorithm for random numbers has an even density, thus, we should place every check with a random time with even distribution within a check_interval length to keep an even distribution over time.

Regarding SSL checks, it shouldn't be hard for the system to check every 1h, even for many certs. I believe the resource cost of having non-even distribution would be higher than rechecking SSL certs little more often than once every 1d, for example about every hour (which can be up to 2h sometimes)

nook24 commented 7 years ago

The issue i run into is, my master naemon has set additional_freshness_latency=300 in naemon.cfg. So my distributed systems get a time bonus of 5 minutes to compensate load peaks, bad connections, system clock differences, check latency and so on...

Of course a check_interval of 1d is not that smart, but if i reduce it to 2h i still have the same issue. If i reload/restart my master naemon and my distributed naemon, both instance will create new and different random values for next_check.

So my master naemon may be get panic and will fire up a freshness_check, before my distributed naemon instance think its time to reschedule the check. And i guess to only way to fix this, is to set additional_freshness_latency to a high value, which is bad because i don't see that the check is old because of lost connection. Or i reduce the check_intervall, but this also will not fix the issue, as long as my check_interval is > 5minutes.

What i don't really get is, why does naemon recalculate the next_check on restart? I guess this only make sense, if next_check is 0, or not? //Edit Or if next_check is < NOW on naemon startup

pengi commented 7 years ago

In that case, it sounds more like a problem with the distribution. What do you use for distributing naemon then?

We need to recalculate the next check due to not get huge load spikes frequently after restarting the system.

nook24 commented 7 years ago

@pengi I crate a passive service on my master system like this:

define host{
    use                                 efbee68c-cf48-4b78-83f5-c856c56177f0
    host_name                           8ad25d51-6a3f-4e4c-bd2e-1b23d48872d0
    display_name                        staging-sat-2000
    address                             127.0.0.1
    alias                               default host

    ;Check settings:
    check_command                       2106cf0bf26a82af262c4078e6d9f94eded84d2a ;check_fresh
    check_period                        41012866-6114-4853-9caf-6ffd19954e50 ;24x7
    check_interval                      600
    retry_interval                      60
    max_check_attempts                  3
    active_checks_enabled               0
    check_freshness                     1
    freshness_threshold                 900

    ;Notification settings:
    notification_interval               7200
    notification_period                 41012866-6114-4853-9caf-6ffd19954e50
    notification_options                u,r

    ;Flap detection settings:
    flap_detection_enabled              0

    ;Everything else:
}

On the distributed instance my config looks like this:

define host{
    use                                 efbee68c-cf48-4b78-83f5-c856c56177f0
    host_name                           8ad25d51-6a3f-4e4c-bd2e-1b23d48872d0
    display_name                        staging-sat-2000
    address                             127.0.0.1
    alias                               default host

    ;Check settings:
    check_command                       5a538ebc-03de-4ce6-8e32-665b841abde3
    check_period                        41012866-6114-4853-9caf-6ffd19954e50
    check_interval                      600
    retry_interval                      60
    max_check_attempts                  3
    active_checks_enabled               1

    ;Notification settings:
    notification_interval               7200
    notification_period                 41012866-6114-4853-9caf-6ffd19954e50
    notification_options                u,r

    ;Flap detection settings:
    flap_detection_enabled              0

    ;Everything else:
}

You can ignore the uuid stuff, its just because of auto-generated configs.

The distributed naemon instance is using an ochp/ocsp command to write the results to a local gearman queue. On the master system there is basically a while(true) running, fetching results out of the remote gearman-job-server and write them to the local naemon.qh. This is running for a few years right now... (in later versions with naemon.cmd or checkresult spool dir)

In this screenshot you can see, that the distributed system was reporting the check results without any issues (5:17am - 6:17am 31.10.16) bildschirmfoto 2016-11-01 um 14 06 52

(6:22am - 8:47am 31.10.16) bildschirmfoto 2016-11-01 um 14 06 35

And here you can see, that due to some naemon restarts, i get "Service is no longer current" messages which are caused by check_fresh. bildschirmfoto 2016-11-01 um 14 06 20

This are the restarts out of the master naemon.log:

[31.10.2016 - 09:16:59] Naemon 1.0.5-source starting... (PID=27729)
[31.10.2016 - 09:43:30] Caught 'Hangup', restarting...
[31.10.2016 - 09:43:34] Naemon 1.0.5-source starting... (PID=27729)

Than it takes some time, until the system is running normal again. On this example system it is not that bad, because every service has a check_interval of 300secs and the hosts are checked every 600secs

So thats not a problem of the distributed solution. The issue is caused, that the last check_result in the master naemon is to old, due to the huge difference between last_check and next_check values :(

PS: Every time a reload of the master naemon is done, a few seconds later the distributed naemon gets automatically reloaded as well.

nook24 commented 7 years ago

But this is not an issue of the DM itsself but due to the freshness you just notice the the check gets scheduled to unexpected timings

sni commented 7 years ago

I think we should just keep the next_check over restarts and only reschedule them a few seconds if the next_check is in the past already, which normally should only happen if you stop naemon for a few minutes. This would also make restarts even faster. Also we should cap the reschedule window by lets say 5 minutes.

nook24 commented 7 years ago

Any update for this? :)

sponkydive commented 7 years ago

merlin also complains "mon check distribution" with expired checks. Seems like the other peers are awaiting results from the restarting peer. To keep this list cleaned up fast, i think the idea of sni would be nice.

jacobbaungard commented 6 years ago

Closed by https://github.com/naemon/naemon-core/pull/259

naemon / naemon-core

Host- and Servicechecks maybe never executed #156