Naemon fake state change events if the host is down

ibering commented 5 years ago

Today my colleagues discovered a weird behavior with "fake" a hard state change for the services.

What's the issue with this?

This manipulates (and breaks) the data of nebstruct_statechange_data.

Time	Check #	State	StateType	State Change	Notes
0	1/5	DOWN	SOFT	YES	Host change from HARD Up to Down SOFT - OK
1	1/3	Critical	SOFT	YES	Service change from Ok HARD to Critical SOFT - OK
2	2/3	Unknown	HARD	YES	Services change from Critical SOFT to Unknown HARD. `last_hard_state` gets set to 3 - ERROR should be SOFT state and last_hard_state still be 0

Bildschirmfoto 2019-03-29 um 12 04 41

Expected behavior

Time	Check #	State	StateType	State Change	Notes
0	1	DOWN	SOFT	YES	Host change from HARD Up to Down SOFT
1	1/3	Critical	SOFT	YES	Service change from Ok HARD to Critical SOFT
2	2/3	Unknown	SOFT	YES	Service change from Critical SOFT to Unknown SOFT -Set SOFT State, dont touch last_hard_state

This only happens if the host is in a non Up state.

If the host is Up, the last_hard_state works as expected:

Bildschirmfoto 2019-03-29 um 12 04 47

I guess this is the code which is causing the issue: https://github.com/naemon/naemon-core/blob/1b66a24c4b7753e5d0bafded434cb1c0ca026457/src/naemon/checks_service.c#L834-L857

This looks wrong to me: temp_service->last_hard_state = temp_service->current_state;

jacobbaungard commented 5 years ago

This is by design basically. The service status is changed to HARD if the associated host is in (iirc) hard down.

The reason for this is to avoid notification storms. A service which goes to hard down, does not send out notifications if the host is also down. Imagine a host with 100 services, and then, for some weird coincidence, lets say that all the services ends up in hard down state before the host. Then instead of having one notification event which is saying the host is down, you would get 100 service down notifications. I don't think anyone wants that.

Of course it is unlikely that all services goes to hard-down before the host, but it is not at all unlikely that quite a lot of services would be in hard down, before the host gets into hard down, resulting in a lot of notifications.

ibering commented 5 years ago

This is by design basically. The service status is changed to HARD if the associated host is in (iirc) hard down.

Please correct me if I'm wrong, but I can't see a check if the host is in a hard state.

here is the snippet from the naemon.log:

[1553854390] HOST ALERT: localhost TEST;DOWN;SOFT;1;Test alert
[1553854427] SERVICE INFO: localhost TEST;Ping; Service switch to hard down state due to host down.
[1553854427] SERVICE ALERT: localhost TEST;Ping;UNKNOWN;HARD;2;Test alert
[1553854427] HOST ALERT: localhost TEST;UP;SOFT;2;OK - 127.0.0.1: rta 0.037ms, lost 0%

The log reports, the host is Down SOFT, and my Service switch to hard down state due to host down.

jacobbaungard commented 5 years ago

Yes wasn't sure whether the switch was done when the host was soft or hard. Although thinking about it further, it makes sense even when the host is in soft. This to protect against the scenario where both host and service are 2/3 soft, but the service check happens first (which would then cause a notification event if we only switch the service to hard-down when the host reached a hard state). Unless the notification logic doesn't send notifications if the host is in soft-down.

ibering commented 5 years ago

This to protect against the scenario where both host and service are 2/3 soft, but the service check happens first (which would then cause a notification event if we only switch the service to hard-down when the host reached a hard state).

I see what you try to achieve... However, why gets last_hard_state changed to current state? Why is this value touched at all?

Log a notice and suppress the notification should be enough. The current behavior makes it mostly impossible to calculate availability reports because last_hard_state was Ok and suddenly last_hard_state is Unknown which never really occurred.

ibering commented 5 years ago

Any news on this? My database contains over 13 million potential wrong records. :(

jacobbaungard commented 5 years ago

I have not really had time to look into this in any great detail but in my opinion last_hard_state should correspond to the last hard state, even if the service state is forced into a hard state due to the host being down. As far as I can see from the table you included in the issue report, that is the case here?

The service returned a NON-OK state when its host was in either down or unreachable state, and as such (and according to the documentation) the service state was changed to hard unknown, and the last_hard_state changed accordingly?

Not really sure why the the state changed from critical to unknown. Hopefully that is according to what the plugin actually returned during that service check.

nook24 commented 5 years ago

The problem is with this line: https://github.com/naemon/naemon-core/blob/1b66a24c4b7753e5d0bafded434cb1c0ca026457/src/naemon/checks_service.c#L852

Please don't overwrite last_hard_state with current state, this makes no sense because last_hard_state is not current_state. It is Ok that the services switches to hard state. But you should definitely keep the last_hard_state as it is.

Bildschirmfoto 2019-04-10 um 10 09 31

I guess you just need to remove this line and the problem should be gone...

jacobbaungard commented 5 years ago

I really don't see the logic behind having a field called last_hard_state which is not equal to the current state, assuming the current state type is in fact hard.

The code you link to changes the current state_type to hard, and therefore the last_hard_state is equal to the current state? That seems super sensible to me.

jacobbaungard commented 5 years ago

Perhaps what you want is something like previous_hard_state, which always has the last hard_state but not including the current state?

nook24 commented 5 years ago

What is the difference between previous_hard_state and last_hard_state ?

I mean, struct service has

int    current_state;
int    last_state;
int    last_hard_state;
int    state_type;

What is the job of last_hard_state? last_hard_state can not be equal to current_state because it is in the past?

I'm sorry, but I need to be missing something here.

(And objects_service.h mix up \t and spaces)

jacobbaungard commented 5 years ago

Probably "last" should really be "latest" if we are looking looking at the English definition of the word.

last_hard_state is different from current_state when the state_type is soft. Having a parameter that behaves like this is useful in its own right, and I am pretty hesitant to change the logic of it.

I am all for implementing something like previous_hard_state which would behave different compared to last_hard_state in that it doesn't take into account the current_state, even if the current_state is hard.

sni commented 5 years ago

nothing changed at that point, last_hard_state has always been set to the current_state as soon as there was a hard state change. The only new thing is, that there is a fake hard state change if the host goes down.

nook24 commented 5 years ago

As long as the host is Up, last_hard_state works as expected. It holds the previous_hard_state. If tested this using NEBCALLBACK_STATE_CHANGE_DATA. Usually in NEBCALLBACK_STATE_CHANGE_DATA is last_hard_state == previous_hard_state

If the host is down and the service gets forced by the fake hard state code the behavior slidely differs. In the NEBCALLBACK_STATE_CHANGE_DATA event the gets fired by this fake hard state, last_hard_state gets set to current_state. So last_hard_state == current_state.

sni commented 5 years ago

the name is missleading, last_hard_state always contains the current/latest hard state.

nook24 commented 5 years ago

Maybe I found the issue. I will run some tests and send an PR if so.

jacobbaungard commented 5 years ago

Fixed by https://github.com/naemon/naemon-core/pull/291

naemon / naemon-core

Naemon fake state change events if the host is down #287

What's the issue with this?

Expected behavior