Closed Bryce-Souers closed 1 year ago
if naemon just moves the next_check without actually doing anything, it means naemon could not find a valid next_check time. Reasons could be a invalid timeperiod or a failed check dependency.
What should I check here then? If it was a dependency issue then wouldn't forcing a scheduled check through Thruk also not work?
if you you force a check, then dependencies won't be checked. You could enable the debug.log and increase debug level. The reason in the log then.
Ok I will try that, thank you.
Maybe I'm misunderstanding what "dependency" means in this context. Is there docs I can read about?
I am talking about those: http://www.naemon.io/documentation/usersguide/dependencies.html
I have no dependencies defined, and the debug level was already set to the maximum verbosity possible.
What other things can I check for what's going wrong here?
I was able to get something to show in the debug.log:
[1684177440.617662] [016.0] [pid=2623931] Service 'check_2days_uptime' on host 'REDACTED' handle_service_check_event()...
But this is only place that this host's name ever appears after this.
Looking at checks_service.c for handle_service_check_event() is could be:
Here is a screenshot from Thruk of the service:
I don't see anything in Thruk showing that it could hit any of 2, 3, or 4. Any ideas?
Figured it out!
We had a timeperiod which was defined as follows:
define timeperiod{
timeperiod_name nan-reload-schedule
alias NAN Reload Schedule (Don't schedule between 4 minutes - 6 minutes after the hour)
monday 00:00-00:04,00:06-01:04,01:06-02:04,02:06-03:04,03:06-04:04,04:06-05:04,5:06-6:04,06:06-07:04,07:06-08:04,08:06-09:04,09:06-10:04,10:06-11:04,11:06-12:04,12:06-13:04,13:06-14:04,14:06-15:04,15:06-16:04,16:06-17:04,17:06-18:04,18:06-19:04,19:06-20:04,20:06-21:04,21:06-22:04,22:06-23:04,23:06-24:00
tuesday 00:00-00:04,00:06-01:04,01:06-02:04,02:06-03:04,03:06-04:04,04:06-05:04,5:06-6:04,06:06-07:04,07:06-08:04,08:06-09:04,09:06-10:04,10:06-11:04,11:06-12:04,12:06-13:04,13:06-14:04,14:06-15:04,15:06-16:04,16:06-17:04,17:06-18:04,18:06-19:04,19:06-20:04,20:06-21:04,21:06-22:04,22:06-23:04,23:06-24:00
wednesday 00:00-00:04,00:06-01:04,01:06-02:04,02:06-03:04,03:06-04:04,04:06-05:04,5:06-6:04,06:06-07:04,07:06-08:04,08:06-09:04,09:06-10:04,10:06-11:04,11:06-12:04,12:06-13:04,13:06-14:04,14:06-15:04,15:06-16:04,16:06-17:04,17:06-18:04,18:06-19:04,19:06-20:04,20:06-21:04,21:06-22:04,22:06-23:04,23:06-24:00
thursday 00:00-00:04,00:06-01:04,01:06-02:04,02:06-03:04,03:06-04:04,04:06-05:04,5:06-6:04,06:06-07:04,07:06-08:04,08:06-09:04,09:06-10:04,10:06-11:04,11:06-12:04,12:06-13:04,13:06-14:04,14:06-15:04,15:06-16:04,16:06-17:04,17:06-18:04,18:06-19:04,19:06-20:04,20:06-21:04,21:06-22:04,22:06-23:04,23:06-24:00
friday 00:00-00:04,00:06-01:04,01:06-02:04,02:06-03:04,03:06-04:04,04:06-05:04,5:06-6:04,06:06-07:04,07:06-08:04,08:06-09:04,09:06-10:04,10:06-11:04,11:06-12:04,12:06-13:04,13:06-14:04,14:06-15:04,15:06-16:04,16:06-17:04,17:06-18:04,18:06-19:04,19:06-20:04,20:06-21:04,21:06-22:04,22:06-23:04,23:06-24:00
saturday 00:00-00:04,00:06-01:04,01:06-02:04,02:06-03:04,03:06-04:04,04:06-05:04,5:06-6:04,06:06-07:04,07:06-08:04,08:06-09:04,09:06-10:04,10:06-11:04,11:06-12:04,12:06-13:04,13:06-14:04,14:06-15:04,15:06-16:04,16:06-17:04,17:06-18:04,18:06-19:04,19:06-20:04,20:06-21:04,21:06-22:04,22:06-23:04,23:06-24:00
sunday 00:00-00:04,00:06-01:04,01:06-02:04,02:06-03:04,03:06-04:04,04:06-05:04,5:06-6:04,06:06-07:04,07:06-08:04,08:06-09:04,09:06-10:04,10:06-11:04,11:06-12:04,12:06-13:04,13:06-14:04,14:06-15:04,15:06-16:04,16:06-17:04,17:06-18:04,18:06-19:04,19:06-20:04,20:06-21:04,21:06-22:04,22:06-23:04,23:06-24:00
}
This was stopping services from being executed when they were expected. Thanks for your input @sni
Closing issue.
Glad to see it's working now.
I have the containerized version of Naemon pulled from Dockerhub running on a VM.
The Naemon is holding 11k hosts, with their active checks turned off after their first check (it performs its first initial host check which is configured to always return OK, then sets active checks off).
Each host has 2 services:
For some reason, out of the 22k services supposed to be checked — 65 of them are not ever being checked. They are stuck in pending.
I waited an entire week and they never got checked.
I checked the Performance Info on Thruk and it says that the number of services actively checked since program start is 98.6%.
The average latency is under 1 second, so I don’t believe this to be a performance issue.
The checks that are stuck pending have a “Next Scheduled Time”. When that time comes, the check doesn’t execute, but it gets a new “Next Scheduled Time” in the future.
Maybe interestingly, ONLY the check #1 (the one that runs every hour) is the ones that get stuck pending.
If I manually force a scheduled check, it works fine (the check runs once) but it still will not be executed on its scheduled time.
The debug.log has no entries about these “ghost” (that’s what I call them) services.
It’s like Naemon is choosing to just ignore them. Not sure what to do here.
Any ideas?