naemon / naemon-core

Networks, Applications and Event Monitor
http://www.naemon.io/
GNU General Public License v2.0
151 stars 63 forks source link

Not all checks being executed #429

Closed Bryce-Souers closed 1 year ago

Bryce-Souers commented 1 year ago

I have the containerized version of Naemon pulled from Dockerhub running on a VM.

The Naemon is holding 11k hosts, with their active checks turned off after their first check (it performs its first initial host check which is configured to always return OK, then sets active checks off).

Each host has 2 services:

  1. Runs every hour
  2. Runs every 10 minutes

For some reason, out of the 22k services supposed to be checked — 65 of them are not ever being checked. They are stuck in pending.

I waited an entire week and they never got checked.

I checked the Performance Info on Thruk and it says that the number of services actively checked since program start is 98.6%.

The average latency is under 1 second, so I don’t believe this to be a performance issue.

The checks that are stuck pending have a “Next Scheduled Time”. When that time comes, the check doesn’t execute, but it gets a new “Next Scheduled Time” in the future.

Maybe interestingly, ONLY the check #1 (the one that runs every hour) is the ones that get stuck pending.

If I manually force a scheduled check, it works fine (the check runs once) but it still will not be executed on its scheduled time.

The debug.log has no entries about these “ghost” (that’s what I call them) services.

It’s like Naemon is choosing to just ignore them. Not sure what to do here.

Any ideas?

sni commented 1 year ago

if naemon just moves the next_check without actually doing anything, it means naemon could not find a valid next_check time. Reasons could be a invalid timeperiod or a failed check dependency.

Bryce-Souers commented 1 year ago

What should I check here then? If it was a dependency issue then wouldn't forcing a scheduled check through Thruk also not work?

sni commented 1 year ago

if you you force a check, then dependencies won't be checked. You could enable the debug.log and increase debug level. The reason in the log then.

Bryce-Souers commented 1 year ago

Ok I will try that, thank you.

Maybe I'm misunderstanding what "dependency" means in this context. Is there docs I can read about?

sni commented 1 year ago

I am talking about those: http://www.naemon.io/documentation/usersguide/dependencies.html

Bryce-Souers commented 1 year ago

I have no dependencies defined, and the debug level was already set to the maximum verbosity possible.

What other things can I check for what's going wrong here?

Bryce-Souers commented 1 year ago

I was able to get something to show in the debug.log:

[1684177440.617662] [016.0] [pid=2623931] Service 'check_2days_uptime' on host 'REDACTED' handle_service_check_event()...

But this is only place that this host's name ever appears after this.

Looking at checks_service.c for handle_service_check_event() is could be:

  1. evprop->execution_type !== EVENT_EXEC_NORMAL (IDK where to check this?)
  2. execute_service_checks == FALSE
  3. temp_service->checks_enabled == FALSE
  4. check_time_against_period(time(NULL), temp_service->check_period_ptr) == ERROR

Here is a screenshot from Thruk of the service:

Screen Shot 2023-05-15 at 2 36 46 PM

I don't see anything in Thruk showing that it could hit any of 2, 3, or 4. Any ideas?

Bryce-Souers commented 1 year ago

Figured it out!

We had a timeperiod which was defined as follows:

define timeperiod{
         timeperiod_name         nan-reload-schedule
         alias                   NAN Reload Schedule (Don't schedule between 4 minutes - 6 minutes after the hour)
         monday                  00:00-00:04,00:06-01:04,01:06-02:04,02:06-03:04,03:06-04:04,04:06-05:04,5:06-6:04,06:06-07:04,07:06-08:04,08:06-09:04,09:06-10:04,10:06-11:04,11:06-12:04,12:06-13:04,13:06-14:04,14:06-15:04,15:06-16:04,16:06-17:04,17:06-18:04,18:06-19:04,19:06-20:04,20:06-21:04,21:06-22:04,22:06-23:04,23:06-24:00
         tuesday                 00:00-00:04,00:06-01:04,01:06-02:04,02:06-03:04,03:06-04:04,04:06-05:04,5:06-6:04,06:06-07:04,07:06-08:04,08:06-09:04,09:06-10:04,10:06-11:04,11:06-12:04,12:06-13:04,13:06-14:04,14:06-15:04,15:06-16:04,16:06-17:04,17:06-18:04,18:06-19:04,19:06-20:04,20:06-21:04,21:06-22:04,22:06-23:04,23:06-24:00
         wednesday               00:00-00:04,00:06-01:04,01:06-02:04,02:06-03:04,03:06-04:04,04:06-05:04,5:06-6:04,06:06-07:04,07:06-08:04,08:06-09:04,09:06-10:04,10:06-11:04,11:06-12:04,12:06-13:04,13:06-14:04,14:06-15:04,15:06-16:04,16:06-17:04,17:06-18:04,18:06-19:04,19:06-20:04,20:06-21:04,21:06-22:04,22:06-23:04,23:06-24:00
         thursday                00:00-00:04,00:06-01:04,01:06-02:04,02:06-03:04,03:06-04:04,04:06-05:04,5:06-6:04,06:06-07:04,07:06-08:04,08:06-09:04,09:06-10:04,10:06-11:04,11:06-12:04,12:06-13:04,13:06-14:04,14:06-15:04,15:06-16:04,16:06-17:04,17:06-18:04,18:06-19:04,19:06-20:04,20:06-21:04,21:06-22:04,22:06-23:04,23:06-24:00
         friday                  00:00-00:04,00:06-01:04,01:06-02:04,02:06-03:04,03:06-04:04,04:06-05:04,5:06-6:04,06:06-07:04,07:06-08:04,08:06-09:04,09:06-10:04,10:06-11:04,11:06-12:04,12:06-13:04,13:06-14:04,14:06-15:04,15:06-16:04,16:06-17:04,17:06-18:04,18:06-19:04,19:06-20:04,20:06-21:04,21:06-22:04,22:06-23:04,23:06-24:00
         saturday                00:00-00:04,00:06-01:04,01:06-02:04,02:06-03:04,03:06-04:04,04:06-05:04,5:06-6:04,06:06-07:04,07:06-08:04,08:06-09:04,09:06-10:04,10:06-11:04,11:06-12:04,12:06-13:04,13:06-14:04,14:06-15:04,15:06-16:04,16:06-17:04,17:06-18:04,18:06-19:04,19:06-20:04,20:06-21:04,21:06-22:04,22:06-23:04,23:06-24:00
         sunday                  00:00-00:04,00:06-01:04,01:06-02:04,02:06-03:04,03:06-04:04,04:06-05:04,5:06-6:04,06:06-07:04,07:06-08:04,08:06-09:04,09:06-10:04,10:06-11:04,11:06-12:04,12:06-13:04,13:06-14:04,14:06-15:04,15:06-16:04,16:06-17:04,17:06-18:04,18:06-19:04,19:06-20:04,20:06-21:04,21:06-22:04,22:06-23:04,23:06-24:00
         }

This was stopping services from being executed when they were expected. Thanks for your input @sni

Closing issue.

sni commented 1 year ago

Glad to see it's working now.