Closed ibering closed 5 years ago
This is by design basically. The service status is changed to HARD if the associated host is in (iirc) hard down.
The reason for this is to avoid notification storms. A service which goes to hard down, does not send out notifications if the host is also down. Imagine a host with 100 services, and then, for some weird coincidence, lets say that all the services ends up in hard down state before the host. Then instead of having one notification event which is saying the host is down, you would get 100 service down notifications. I don't think anyone wants that.
Of course it is unlikely that all services goes to hard-down before the host, but it is not at all unlikely that quite a lot of services would be in hard down, before the host gets into hard down, resulting in a lot of notifications.
This is by design basically. The service status is changed to HARD if the associated host is in (iirc) hard down.
Please correct me if I'm wrong, but I can't see a check if the host is in a hard state.
here is the snippet from the naemon.log
:
[1553854390] HOST ALERT: localhost TEST;DOWN;SOFT;1;Test alert
[1553854427] SERVICE INFO: localhost TEST;Ping; Service switch to hard down state due to host down.
[1553854427] SERVICE ALERT: localhost TEST;Ping;UNKNOWN;HARD;2;Test alert
[1553854427] HOST ALERT: localhost TEST;UP;SOFT;2;OK - 127.0.0.1: rta 0.037ms, lost 0%
The log reports, the host is Down SOFT, and my Service switch to hard down state due to host down.
Yes wasn't sure whether the switch was done when the host was soft or hard. Although thinking about it further, it makes sense even when the host is in soft. This to protect against the scenario where both host and service are 2/3 soft, but the service check happens first (which would then cause a notification event if we only switch the service to hard-down when the host reached a hard state). Unless the notification logic doesn't send notifications if the host is in soft-down.
This to protect against the scenario where both host and service are 2/3 soft, but the service check happens first (which would then cause a notification event if we only switch the service to hard-down when the host reached a hard state).
I see what you try to achieve... However, why gets last_hard_state
changed to current state? Why is this value touched at all?
Log a notice and suppress the notification should be enough. The current behavior makes it mostly impossible to calculate availability reports because last_hard_state
was Ok and suddenly last_hard_state
is Unknown which never really occurred.
Any news on this? My database contains over 13 million potential wrong records. :(
I have not really had time to look into this in any great detail but in my opinion last_hard_state
should correspond to the last hard state, even if the service state is forced into a hard state due to the host being down. As far as I can see from the table you included in the issue report, that is the case here?
The service returned a NON-OK state when its host was in either down or unreachable state, and as such (and according to the documentation) the service state was changed to hard unknown, and the last_hard_state
changed accordingly?
Not really sure why the the state changed from critical to unknown. Hopefully that is according to what the plugin actually returned during that service check.
The problem is with this line: https://github.com/naemon/naemon-core/blob/1b66a24c4b7753e5d0bafded434cb1c0ca026457/src/naemon/checks_service.c#L852
Please don't overwrite last_hard_state with current state, this makes no sense because last_hard_state is not current_state. It is Ok that the services switches to hard state. But you should definitely keep the last_hard_state as it is.
I guess you just need to remove this line and the problem should be gone...
I really don't see the logic behind having a field called last_hard_state
which is not equal to the current state, assuming the current state type is in fact hard.
The code you link to changes the current state_type
to hard, and therefore the last_hard_state
is equal to the current state? That seems super sensible to me.
Perhaps what you want is something like previous_hard_state
, which always has the last hard_state but not including the current state?
What is the difference between previous_hard_state
and last_hard_state
?
I mean, struct service
has
int current_state;
int last_state;
int last_hard_state;
int state_type;
What is the job of last_hard_state?
last_hard_state
can not be equal to current_state
because it is in the past?
I'm sorry, but I need to be missing something here.
(And objects_service.h mix up \t and spaces)
Probably "last" should really be "latest" if we are looking looking at the English definition of the word.
last_hard_state
is different from current_state
when the state_type
is soft. Having a parameter that behaves like this is useful in its own right, and I am pretty hesitant to change the logic of it.
I am all for implementing something like previous_hard_state
which would behave different compared to last_hard_state
in that it doesn't take into account the current_state
, even if the current_state
is hard.
nothing changed at that point, last_hard_state
has always been set to the current_state
as soon as there was a hard state change. The only new thing is, that there is a fake hard state change if the host goes down.
As long as the host is Up, last_hard_state
works as expected. It holds the previous_hard_state.
If tested this using NEBCALLBACK_STATE_CHANGE_DATA
.
Usually in NEBCALLBACK_STATE_CHANGE_DATA
is last_hard_state
== previous_hard_state
If the host is down and the service gets forced by the fake hard state code the behavior slidely differs. In the NEBCALLBACK_STATE_CHANGE_DATA
event the gets fired by this fake hard state, last_hard_state
gets set to current_state
. So last_hard_state
== current_state
.
the name is missleading, last_hard_state
always contains the current/latest hard state.
Maybe I found the issue. I will run some tests and send an PR if so.
Today my colleagues discovered a weird behavior with "fake" a hard state change for the services.
What's the issue with this?
This manipulates (and breaks) the data of
nebstruct_statechange_data
.last_hard_state
gets set to 3 - ERROR should be SOFT state and last_hard_state still be 0Expected behavior
This only happens if the host is in a non Up state.
If the host is Up, the last_hard_state works as expected:
I guess this is the code which is causing the issue: https://github.com/naemon/naemon-core/blob/1b66a24c4b7753e5d0bafded434cb1c0ca026457/src/naemon/checks_service.c#L834-L857
This looks wrong to me:
temp_service->last_hard_state = temp_service->current_state;