No soft status for host checks

smetj commented 7 years ago

Hi all,

It seems that in our setup (Naemon 1.0.6) hosts checks directly go to a hard state instead of going through the configured soft state first. The relevant host configuration is:

    max_check_attempts              3
    check_interval                  1
    retry_interval                  1

Despite that, after the first failing ping check it directly goes into HARD DOWN status. Viewing the host with Thruk shows the problem:

selection_070

I could find another user reporting the same albeit for Nagios/centreon: https://forum.centreon.com/forum/centreon-use/nagios/11759-after-1-soft-state-nagios-change-the-host-state-directy-to-hard-state-help

Anyone else can replicate this problem?

ricardomaraschini commented 7 years ago

I am guessing here but i think i have saw this behavior before on Nagios. Every time a service check returns a critical, before alerting the service problem a host check is done to assure that the whole host is not down. Maybe these checks(coming from a failing service) are not being counted as host checks? Someone probably will reply your question more properly.

sni commented 7 years ago

Just a side note, but just because it shows 1/3 hard state doesn't necessarily mean there haven't been rechecks. Check the logs to be sure.

dirtyren commented 7 years ago

This is a normal behavior,

HOST 1/3 SOFT HOST 2/3 SOFT HOST 1/3 HARD .

We've changed our interface to show 3/3 when HARD, because our customers would always ask us about this 1/3 HARD, thinking the 3 max check attempts were't done prior to the HARD state.

[]s.

marcomusso commented 7 years ago

I just checked today and by submitting a passive result the host status turns immediately into CRITICAL. Even with 5m/3attempts every 1m. For a service I need to submit more than 1 CRITICAL result...

nook24 commented 7 years ago

# PASSIVE HOST CHECKS ARE SOFT OPTION
# This determines whether or not Naemon will treat passive host
# checks as being HARD or SOFT.  By default, a passive host check
# result will put a host into a HARD state type.  This can be changed
# by enabling this option.
# Values: 0 = passive checks are HARD, 1 = passive checks are SOFT

passive_host_checks_are_soft=0

Have you set this to 1 in your naemon.cfg

marcomusso commented 7 years ago

log_passive_checks=1
accept_passive_service_checks=1
accept_passive_host_checks=1
translate_passive_host_checks=0
passive_host_checks_are_soft=0

dirtyren commented 7 years ago

All hosts is HARD DOWN show current_attempt 1 current_state=1 current_attempt=1 max_attempts=5 state_type=1

on the other hand for services, current_attempt equals max_attempts when service is HARD CRITICAL current_state=2 current_attempt=5 max_attempts=5 state_type=1

Maybe this behavior for host current_attempt could be changed to equal max_attempts when it is HARD DOWN.

[]s.

nook24 commented 7 years ago

I guess the following is happening, but i didn't checked the code...

Host is goging down 3/3: down_3_3

Now Naemon switch back to the check_interval with the next check, because of hard down, and you are again in 1/3 down_1_3

This is the result of the check history: hostchecks

And my log file log

Also status.dat is now back to current_attempt=1 status_dat

I created this test host on a demo box if you want to verify: https://demo.statusengine.org/#!/nodedetails/this%2520host%2520is%2520down

Sorry for the long post

nook24 commented 7 years ago

Just to add this, as @dirtyren already mentioned, services behave a bit different. They stay on 3/3 hard critical: service_critical

status.dat:

servicestatus {
        host_name=this host is up
        service_description=Dummy
        current_state=2
        last_hard_state=2
        current_attempt=3
        max_attempts=3
        state_type=1
        }

sni commented 7 years ago

We could easily overrule this in Thruk to display 3/3 instead if this confuses users. Also max_attempts was never a good way to delay notifications, use first_notification_delay therefor.

smetj commented 7 years ago

ok the passive_host_checks_are_soft has indeed botched the test I have done.

However afaik, at least with out setup, I can still see an inconsistency using only normally scheduled active checks. The host configuration has:

    max_check_attempts              3
    check_interval                  1
    retry_interval                  1

The logs show the following:

[1502701556] HOST EVENT HANDLER: soft-host-test;DOWN;SOFT;1;eventhandler:host.down
[1502701617] HOST ALERT: soft-host-test;DOWN;HARD;3;CRITICAL - 10.66.140.83: rta nan, lost 100%
[1502701617] HOST NOTIFICATION: everyone;soft-host-test;DOWN;alert:host;CRITICAL - 10.66.140.83: rta nan, lost 100%
[1502701617] HOST EVENT HANDLER: soft-host-test;DOWN;HARD;3;eventhandler:host.down

That's 1 minute between the 1st soft state (reported by the HOST EVENT HANDLER) and the host notification (HOST ALERT). From what I understand how things are configured there would have to be be at least 2 minutes between the 1st soft state and the hard state (outgoing alert)

dirtyren commented 7 years ago

If we are to overrule thruk to display 3/3, why not patch naemon to make current_attempt = max_attempts when Host DOWN HARD?

sni commented 7 years ago

@smetj i guess this is due to the forced on demand host checks. Whenever a service fails naemon schedules an immediate on-demand host checks to check the host state. Thats why i prefer the first_notification_delay.

@dirtyren it would be cleaner in naemon itself, but no idea if that has any side effects.

sni commented 7 years ago

Just pushed a change to Thruk: https://github.com/sni/Thruk/commit/e40d3315b8fc92b78718d5edfec003b3ea3c9974 So at least it should show up less confusing over there.

smetj commented 7 years ago

Thanks @sni for the change, that will help.

About the "forced on demand host checks" I don't think this is what is happening because my test host has no services associated with it. It also does not have any parents (or children) defined.

smetj commented 7 years ago

Another check I have done on my side is a continuous LiveStatus query returning following information:

Thu Aug 17 11:17:33 CEST 2017
name;state;state_type;hard_state;current_attempt
soft-host-test;0;1;0;1

Thu Aug 17 11:17:34 CEST 2017
name;state;state_type;hard_state;current_attempt
soft-host-test;1;0;0;1

Thu Aug 17 11:18:34 CEST 2017
name;state;state_type;hard_state;current_attempt
soft-host-test;1;0;0;2

Thu Aug 17 11:18:35 CEST 2017
name;state;state_type;hard_state;current_attempt
soft-host-test;1;1;1;3

You can see that current_attempt 3 follows current_attempt immediately ... I'd expect there to be in minute in between just like going from current_attempt 1 to current_attempt 2

Again, no services associated with this host and only active checks are used with following settings:

    max_check_attempts              3
    check_interval                  1
    retry_interval                  1

VladimirBilik commented 6 years ago

Hi, I've just find out strange counting on HOST SOFT states in our Naemon 1.0.6, I think the behaviour is similar to this one reported here. I have host with

check_interval      5 
retry_interval      5
max_check_attempts  12

Host has no parent. one service defined. According to log the host soft state counter is increased by 2:

What could be wrong in my config? Even if I force checking the counter is still increased by 2, so it effectively shortens change to hard state by half. This behaviour regards just host checking, service checkings are OK.

sni commented 5 years ago

turns out this was an issue in the mod-gearman neb module: https://github.com/sni/mod_gearman/commit/0cd0af310f1b2b5dd1fd83068c70400fa2d18b48 should be fixed with the next release.

VladimirBilik commented 5 years ago

After the fix the problem with double increasing counter disappeared, thanks.

naemon / naemon-core

No soft status for host checks #198