Problem in the SOFT HARD check logic

dirtyren commented 3 years ago

Hello,

I found this problem bellow. The host went down and naemon set the service as CRITICAL HARD, but when the Host came back UP, naemon set the HOST to OK SOFT. This broke some availability reports that depend on HARD states to make the calculations. The question is, should the service not be set to OK HARD when the Host came back up?

Tks.

[Fri Jul 23 03:39:31 2021] INITIAL SERVICE STATE: HOSTDEMO;SVCDEMO;OK;HARD;1;OK [Fri Jul 23 21:41:11 2021] HOST ALERT: HOSTDEMO;DOWN;SOFT;1;CRITICAL - 192.168.54.32: rta nan, lost 100% [Fri Jul 23 21:41:21 2021] HOST ALERT: HOSTDEMO;DOWN;SOFT;2;CRITICAL - 192.168.54.32: rta nan, lost 100% [Fri Jul 23 21:41:37 2021] HOST ALERT: HOSTDEMO;DOWN;HARD;3;CRITICAL - 192.168.54.32: rta nan, lost 100% [Fri Jul 23 21:42:57 2021] SERVICE INFO: HOSTDEMO;SVCDEMO; Service switch to hard down state due to host down. [Fri Jul 23 21:42:57 2021] SERVICE ALERT: HOSTDEMO;SVCDEMO;CRITICAL;HARD;1;CRITICAL - cannot connect [Fri Jul 23 21:46:57 2021] HOST ALERT: HOSTDEMO;UP;HARD;1;OK - 192.168.54.32: , rta 0.259ms, lost 0% [Fri Jul 23 21:47:17 2021] SERVICE ALERT: HOSTDEMO;SVCDEMO;CRITICAL;SOFT;1;CRITICAL - cannot connect [Fri Jul 23 21:49:17 2021] SERVICE ALERT: HOSTDEMO;SVCDEMO;CRITICAL;SOFT;2;CRITICAL - cannot connect [Fri Jul 23 21:51:18 2021] SERVICE ALERT: HOSTDEMO;SVCDEMO;OK;SOFT;3;OK

dirtyren commented 3 years ago

I got another behavior , naemon did not generate a state change for the service to OK, but the INITIAL LOG STATE changed to OK, like this [Thu Jun 17 18:43:01 2021] SERVICE INFO: PABX;Port_8443; Service switch to hard down state due to host down. [Thu Jun 17 18:43:01 2021] SERVICE ALERT: PABX;Port_8443;CRITICAL;HARD;1;CRITICAL - Socket timeout after 10 seconds [Thu Jun 17 18:50:21 2021] HOST ALERT: PABX;UP;HARD;1;OK - x.x.x.x: , rta 0.446ms, lost 0% [Thu Jun 17 18:59:35 2021] INITIAL HOST STATE: PABX;UP;HARD;1;OK - x.x.x.x: , rta 0.234ms, lost 0% [Thu Jun 17 18:59:35 2021] INITIAL SERVICE STATE: PABX;Port_8443;OK;HARD;1;TCP OK - 0.000 second response time on x.x.x.x on port 8443

If you check this, the plugin output for the service when CRITICAL was CRITICAL - Socket timeout after 10 seconds, when naemon was restarted, the plugin output changed for the OK exit, but the SERVICE ALERT for the OK HARD states was not generated. If you see, the HOST came back to OK 9minutes before naemon was restarted, and no SERVICE ALERT OK state was generate for the service.

[]s.

ccztux commented 3 months ago

Hello,

I found this problem bellow. The host went down and naemon set the service as CRITICAL HARD, but when the Host came back UP, naemon set the HOST to OK SOFT. This broke some availability reports that depend on HARD states to make the calculations. The question is, should the service not be set to OK HARD when the Host came back up?
Tks.
[Fri Jul 23 03:39:31 2021] INITIAL SERVICE STATE: HOSTDEMO;SVCDEMO;OK;HARD;1;OK [Fri Jul 23 21:41:11 2021] HOST ALERT: HOSTDEMO;DOWN;SOFT;1;CRITICAL - 192.168.54.32: rta nan, lost 100% [Fri Jul 23 21:41:21 2021] HOST ALERT: HOSTDEMO;DOWN;SOFT;2;CRITICAL - 192.168.54.32: rta nan, lost 100% [Fri Jul 23 21:41:37 2021] HOST ALERT: HOSTDEMO;DOWN;HARD;3;CRITICAL - 192.168.54.32: rta nan, lost 100% [Fri Jul 23 21:42:57 2021] SERVICE INFO: HOSTDEMO;SVCDEMO; Service switch to hard down state due to host down. [Fri Jul 23 21:42:57 2021] SERVICE ALERT: HOSTDEMO;SVCDEMO;CRITICAL;HARD;1;CRITICAL - cannot connect [Fri Jul 23 21:46:57 2021] HOST ALERT: HOSTDEMO;UP;HARD;1;OK - 192.168.54.32: , rta 0.259ms, lost 0% [Fri Jul 23 21:47:17 2021] SERVICE ALERT: HOSTDEMO;SVCDEMO;CRITICAL;SOFT;1;CRITICAL - cannot connect [Fri Jul 23 21:49:17 2021] SERVICE ALERT: HOSTDEMO;SVCDEMO;CRITICAL;SOFT;2;CRITICAL - cannot connect [Fri Jul 23 21:51:18 2021] SERVICE ALERT: HOSTDEMO;SVCDEMO;OK;SOFT;3;OK

Unfortunately i can confirm this behaviour in Naemon 1.4.1

[Tue Jun 25 03:40:01 2024] CURRENT SERVICE STATE: localhost;NCPA Connection;OK;HARD;1;OK: NCPA Agent (Version: 2.1.6, OS: Windows) is accessible via API (HTTPS, Port: 5693)
[Tue Jun 25 10:09:50 2024] SERVICE DOWNTIME ALERT: localhost;NCPA Connection;STARTED; Service has entered a period of scheduled downtime
[Tue Jun 25 10:09:50 2024] SERVICE NOTIFICATION SUPPRESSED: localhost;NCPA Connection;Notifications about SCHEDULED DOWNTIME events blocked for this object.
[Tue Jun 25 10:15:13 2024] SERVICE INFO: localhost;NCPA Connection; Service switch to hard down state due to host down.
[Tue Jun 25 10:15:13 2024] SERVICE ALERT: localhost;NCPA Connection;CRITICAL;HARD;1;CRITICAL - Connection to API (HTTPS, Port: 5693) failed.  Connection error: Connection timed out after 58000 milliseconds
[Tue Jun 25 10:24:20 2024] SERVICE ALERT: localhost;NCPA Connection;CRITICAL;SOFT;1;CRITICAL - Connection to API (HTTPS, Port: 5693) failed.  Connection error: Failed to connect to 127.0.0.1 port 5693: Connection refused
[Tue Jun 25 10:27:24 2024] SERVICE ALERT: localhost;NCPA Connection;OK;SOFT;2;OK: NCPA Agent (Version: 2.1.6, OS: Windows) is accessible via API (HTTPS, Port: 5693)
[Tue Jun 25 10:42:46 2024] SERVICE ALERT: localhost;NCPA Connection;CRITICAL;SOFT;1;CRITICAL - Connection to API (HTTPS, Port: 5693) failed.  Connection error: Failed to connect to 127.0.0.1 port 5693: Connection refused
[Tue Jun 25 10:45:52 2024] SERVICE ALERT: localhost;NCPA Connection;OK;SOFT;2;OK: NCPA Agent (Version: 2.1.6, OS: Windows) is accessible via API (HTTPS, Port: 5693)
[Tue Jun 25 16:00:00 2024] SERVICE DOWNTIME ALERT: localhost;NCPA Connection;STOPPED; Service has exited from a period of scheduled downtime

The OK HARD state is missing after the OK SOFT state as described in the documentation

The OK SOFT state changes to OK HARD by the CURRENT SERVICE STATE entry when the log file will be rotated the next day.

This is how it looks like in a thruk availability report:

naemon / naemon-core

Problem in the SOFT HARD check logic #368