Closed smetj closed 5 years ago
I am guessing here but i think i have saw this behavior before on Nagios. Every time a service check returns a critical
, before alerting the service problem a host check is done to assure that the whole host is not down. Maybe these checks(coming from a failing service) are not being counted as host checks? Someone probably will reply your question more properly.
Just a side note, but just because it shows 1/3 hard state
doesn't necessarily mean there haven't been rechecks. Check the logs to be sure.
This is a normal behavior,
HOST 1/3 SOFT HOST 2/3 SOFT HOST 1/3 HARD .
We've changed our interface to show 3/3 when HARD, because our customers would always ask us about this 1/3 HARD, thinking the 3 max check attempts were't done prior to the HARD state.
[]s.
I just checked today and by submitting a passive result the host status turns immediately into CRITICAL. Even with 5m/3attempts every 1m. For a service I need to submit more than 1 CRITICAL result...
# PASSIVE HOST CHECKS ARE SOFT OPTION
# This determines whether or not Naemon will treat passive host
# checks as being HARD or SOFT. By default, a passive host check
# result will put a host into a HARD state type. This can be changed
# by enabling this option.
# Values: 0 = passive checks are HARD, 1 = passive checks are SOFT
passive_host_checks_are_soft=0
Have you set this to 1 in your naemon.cfg
log_passive_checks=1
accept_passive_service_checks=1
accept_passive_host_checks=1
translate_passive_host_checks=0
passive_host_checks_are_soft=0
All hosts is HARD DOWN show current_attempt 1 current_state=1 current_attempt=1 max_attempts=5 state_type=1
on the other hand for services, current_attempt equals max_attempts when service is HARD CRITICAL current_state=2 current_attempt=5 max_attempts=5 state_type=1
Maybe this behavior for host current_attempt could be changed to equal max_attempts when it is HARD DOWN.
[]s.
I guess the following is happening, but i didn't checked the code...
Host is goging down 3/3:
Now Naemon switch back to the check_interval
with the next check, because of hard down, and you are again in 1/3
This is the result of the check history:
And my log file
Also status.dat
is now back to current_attempt=1
I created this test host on a demo box if you want to verify: https://demo.statusengine.org/#!/nodedetails/this%2520host%2520is%2520down
Sorry for the long post
Just to add this, as @dirtyren already mentioned, services behave a bit different. They stay on 3/3 hard critical:
status.dat:
servicestatus {
host_name=this host is up
service_description=Dummy
current_state=2
last_hard_state=2
current_attempt=3
max_attempts=3
state_type=1
}
We could easily overrule this in Thruk to display 3/3 instead if this confuses users. Also max_attempts
was never a good way to delay notifications, use first_notification_delay
therefor.
ok the passive_host_checks_are_soft
has indeed botched the test I have done.
However afaik, at least with out setup, I can still see an inconsistency using only normally scheduled active checks. The host configuration has:
max_check_attempts 3
check_interval 1
retry_interval 1
The logs show the following:
[1502701556] HOST EVENT HANDLER: soft-host-test;DOWN;SOFT;1;eventhandler:host.down
[1502701617] HOST ALERT: soft-host-test;DOWN;HARD;3;CRITICAL - 10.66.140.83: rta nan, lost 100%
[1502701617] HOST NOTIFICATION: everyone;soft-host-test;DOWN;alert:host;CRITICAL - 10.66.140.83: rta nan, lost 100%
[1502701617] HOST EVENT HANDLER: soft-host-test;DOWN;HARD;3;eventhandler:host.down
That's 1 minute between the 1st soft state (reported by the HOST EVENT HANDLER) and the host notification (HOST ALERT). From what I understand how things are configured there would have to be be at least 2 minutes between the 1st soft state and the hard state (outgoing alert)
If we are to overrule thruk to display 3/3, why not patch naemon to make current_attempt = max_attempts when Host DOWN HARD?
@smetj i guess this is due to the forced on demand host checks
. Whenever a service fails naemon schedules an immediate on-demand host checks to check the host state. Thats why i prefer the first_notification_delay.
@dirtyren it would be cleaner in naemon itself, but no idea if that has any side effects.
Just pushed a change to Thruk: https://github.com/sni/Thruk/commit/e40d3315b8fc92b78718d5edfec003b3ea3c9974 So at least it should show up less confusing over there.
Thanks @sni for the change, that will help.
About the "forced on demand host checks" I don't think this is what is happening because my test host has no services associated with it. It also does not have any parents (or children) defined.
Another check I have done on my side is a continuous LiveStatus query returning following information:
Thu Aug 17 11:17:33 CEST 2017
name;state;state_type;hard_state;current_attempt
soft-host-test;0;1;0;1
Thu Aug 17 11:17:34 CEST 2017
name;state;state_type;hard_state;current_attempt
soft-host-test;1;0;0;1
Thu Aug 17 11:18:34 CEST 2017
name;state;state_type;hard_state;current_attempt
soft-host-test;1;0;0;2
Thu Aug 17 11:18:35 CEST 2017
name;state;state_type;hard_state;current_attempt
soft-host-test;1;1;1;3
You can see that current_attempt 3 follows current_attempt immediately ... I'd expect there to be in minute in between just like going from current_attempt 1 to current_attempt 2
Again, no services associated with this host and only active checks are used with following settings:
max_check_attempts 3
check_interval 1
retry_interval 1
Hi, I've just find out strange counting on HOST SOFT states in our Naemon 1.0.6, I think the behaviour is similar to this one reported here. I have host with
check_interval 5
retry_interval 5
max_check_attempts 12
Host has no parent. one service defined. According to log the host soft state counter is increased by 2:
What could be wrong in my config? Even if I force checking the counter is still increased by 2, so it effectively shortens change to hard state by half. This behaviour regards just host checking, service checkings are OK.
turns out this was an issue in the mod-gearman neb module: https://github.com/sni/mod_gearman/commit/0cd0af310f1b2b5dd1fd83068c70400fa2d18b48 should be fixed with the next release.
After the fix the problem with double increasing counter disappeared, thanks.
Hi all,
It seems that in our setup (Naemon 1.0.6) hosts checks directly go to a hard state instead of going through the configured soft state first. The relevant host configuration is:
Despite that, after the first failing ping check it directly goes into HARD DOWN status. Viewing the host with Thruk shows the problem:
I could find another user reporting the same albeit for Nagios/centreon: https://forum.centreon.com/forum/centreon-use/nagios/11759-after-1-soft-state-nagios-change-the-host-state-directy-to-hard-state-help
Anyone else can replicate this problem?