Open VladimirBilik opened 5 years ago
could you double check if performance data processing is still enabled when that happens? It can be disabled on the "Process Info" page. Or check the logs for external commands in that direction.
Processing is enabled - other services are OK at the same time. Log does not show any changes on this attribute.
I have another observation - after change in service state the performance data started to be processed again. Debug info: 1) at 1544302037 [Sat Dec 8 21:47:17 CET 2018] the service was in OK state. Performance data was not processed at this time:
[1544302037.528470] [016.0] [pid=29143] Check of service 'ping 10.130.241.129' on host 'gwco.dc' (id=610) was overridden by a module
[1544302037.591615] [016.2] [pid=29143] Processing check result for service 'ping 10.130.241.129' on host 'gwco.dc'
[1544302037.591648] [016.0] [pid=29143] ** Handling check result for service 'ping 10.130.241.129' on host 'gwco.dc' from 'Mod-Gearman Worker'...
[1544302037.591658] [016.1] [pid=29143] HOST: gwco.dc, SERVICE: ping 10.130.241.129, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, EXITED OK: Yes, RETURN CODE: 0, OUTPUT: OK - 10.130.241.129: rta 1.434ms, lost 0%|rta=1.434ms;20.000;50.000;0; pl=0%;5;15;; rtmax=2.037ms;;;; rtmin=1.315ms;;;;
[1544302037.591681] [016.2] [pid=29143] Parsing check output...
[1544302037.591689] [016.2] [pid=29143] Short Output: OK - 10.130.241.129: rta 1.434ms, lost 0%
[1544302037.591697] [016.2] [pid=29143] Long Output: NULL
[1544302037.591705] [016.2] [pid=29143] Perf Data: rta=1.434ms;20.000;50.000;0; pl=0%;5;15;; rtmax=2.037ms;;;; rtmin=1.315ms;;;;
[1544302037.591713] [016.2] [pid=29143] ST: HARD CA: 1 MA: 5 CS: 0 LS: 0 LHS: 0
[1544302037.591722] [016.1] [pid=29143] Service is OK.
[1544302037.591729] [016.1] [pid=29143] Service did not change state.
[1544302037.591740] [064.1] [pid=29143] Making callbacks (type 6)...
[1544302037.591753] [064.2] [pid=29143] Callback #1 (type 6) return code = 0
[1544302037.591764] [064.2] [pid=29143] Callback #2 (type 6) return code = 0
[1544302037.591774] [064.1] [pid=29143] Making callbacks (type 13)...
[1544302037.591783] [016.1] [pid=29143] Checking service 'ping 10.130.241.129' on host 'gwco.dc' for flapping...
[1544302037.591792] [016.2] [pid=29143] LFT=10.00, HFT=40.00, CPC=0.00, PSC=0.00%
[1544302037.591804] [016.1] [pid=29143] Service is not flapping (0.00% state change).
[1544302037.591813] [016.1] [pid=29143] Checking host 'gwco.dc' for flapping...
[1544302037.591821] [016.2] [pid=29143] LFT=10.00, HFT=40.00, CPC=0.00, PSC=0.00%
[1544302037.591831] [016.1] [pid=29143] Host is not flapping (0.00% state change).
2) at 1544302637 [Sat Dec 8 21:57:17 CET 2018] status of service has been changed to WARNING and performance data were processed
[1544302637.526849] [016.0] [pid=29143] Check of service 'ping 10.130.241.129' on host 'gwco.dc' (id=610) was overridden by a module
[1544302639.784929] [016.2] [pid=29143] Processing check result for service 'ping 10.130.241.129' on host 'gwco.dc'
[1544302639.784985] [016.0] [pid=29143] ** Handling check result for service 'ping 10.130.241.129' on host 'gwco.dc' from 'Mod-Gearman Worker'...
[1544302639.784996] [016.1] [pid=29143] HOST: gwco.dc, SERVICE: ping 10.130.241.129, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, EXITED OK: Yes, RETURN CODE: 1, OUTPUT: WARNING - 10.130.241.129: rta 2.900ms, lost 10%|rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;
[1544302639.785020] [016.2] [pid=29143] Parsing check output...
[1544302639.785030] [016.2] [pid=29143] Short Output: WARNING - 10.130.241.129: rta 2.900ms, lost 10%
[1544302639.785038] [016.2] [pid=29143] Long Output: NULL
[1544302639.785046] [016.2] [pid=29143] Perf Data: rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;
[1544302639.785055] [016.2] [pid=29143] ST: HARD CA: 1 MA: 5 CS: 1 LS: 0 LHS: 0
[1544302639.785063] [016.2] [pid=29143] Service has changed state since last check!
[1544302639.785071] [016.1] [pid=29143] Service is in a non-OK state!
[1544302639.785079] [016.1] [pid=29143] Host is currently UP, so we'll recheck its state to make sure...
[1544302639.785092] [064.1] [pid=29143] Making callbacks (type 12)...
[1544302639.785109] [064.2] [pid=29143] Callback #1 (type 12) return code = 0
[1544302639.785119] [016.1] [pid=29143] Current/Max Attempt(s): 1/5
[1544302639.785127] [016.1] [pid=29143] Host is UP, so we'll retry the service check...
[1544302639.785190] [064.1] [pid=29143] Making callbacks (type 2)...
[1544302639.785224] [064.2] [pid=29143] Callback #1 (type 2) return code = 0
[1544302639.785238] [064.1] [pid=29143] Making callbacks (type 23)...
[1544302639.785248] [064.2] [pid=29143] Callback #1 (type 23) return code = 0
[1544302639.785262] [064.1] [pid=29143] Making callbacks (type 13)...
[1544302639.785275] [064.1] [pid=29143] Making callbacks (type 6)...
[1544302639.785286] [064.2] [pid=29143] Callback #1 (type 6) return code = 0
[1544302639.785297] [064.2] [pid=29143] Callback #2 (type 6) return code = 0
[1544302639.785306] [064.1] [pid=29143] Making callbacks (type 13)...
[1544302639.785315] [016.1] [pid=29143] Checking service 'ping 10.130.241.129' on host 'gwco.dc' for flapping...
[1544302639.785323] [016.1] [pid=29143] Checking host 'gwco.dc' for flapping...
[1544302639.785331] [016.2] [pid=29143] LFT=10.00, HFT=40.00, CPC=0.00, PSC=0.00%
[1544302639.785344] [016.1] [pid=29143] Host is not flapping (0.00% state change).
[1544302639.785357] [016.2] [pid=29143] Raw service performance data file output: DATATYPE::SERVICEPERFDATA TIMET::$TIMET$ HOSTNAME::$HOSTNAME$ SERVICEDESC::$SERVICEDESC$ SERVICEPERFDATA::$SERVICEPERFDATA$ SERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$ HOSTSTATE::$HOSTSTATE$ HOSTSTATETYPE::$HOSTSTATETYPE$ SERVICESTATE::$SERVICESTATE$ SERVICESTATETYPE::$SERVICESTATETYPE$ SERVICEOUTPUT::$SERVICEOUTPUT$
[1544302639.785380] [2048.1] [pid=29143] **** BEGIN MACRO PROCESSING ***********
[1544302639.785388] [2048.1] [pid=29143] Processing: 'DATATYPE::SERVICEPERFDATA TIMET::$TIMET$ HOSTNAME::$HOSTNAME$ SERVICEDESC::$SERVICEDESC$ SERVICEPERFDATA::$SERVICEPERFDATA$ SERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$ HOSTSTATE::$HOSTSTATE$ HOSTSTATETYPE::$HOSTSTATETYPE$ SERVICESTATE::$SERVICESTATE$ SERVICESTATETYPE::$SERVICESTATETYPE$ SERVICEOUTPUT::$SERVICEOUTPUT$
'
[1544302639.785397] [2048.2] [pid=29143] Processing part: 'DATATYPE::SERVICEPERFDATA TIMET::'
[1544302639.785406] [2048.2] [pid=29143] Not currently in macro. Running output (33): 'DATATYPE::SERVICEPERFDATA TIMET::'
[1544302639.785414] [2048.2] [pid=29143] Processing part: 'TIMET'
[1544302639.785425] [2048.2] [pid=29143] macros[11] (TIMET) match.
[1544302639.785434] [2048.2] [pid=29143] Processed 'TIMET', Free: 1
[1544302639.785442] [2048.2] [pid=29143] Processed 'TIMET', Free: 1, Cleaning options: 0
[1544302639.785450] [2048.2] [pid=29143] Uncleaned macro. Running output (43): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639'
[1544302639.785458] [2048.2] [pid=29143] Just finished macro. Running output (43): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639'
[1544302639.785466] [2048.2] [pid=29143] Processing part: ' HOSTNAME::'
[1544302639.785473] [2048.2] [pid=29143] Not currently in macro. Running output (54): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::'
[1544302639.785481] [2048.2] [pid=29143] Processing part: 'HOSTNAME'
[1544302639.785489] [2048.2] [pid=29143] macros[0] (HOSTNAME) match.
[1544302639.785498] [2048.2] [pid=29143] Processed 'HOSTNAME', Free: 0
[1544302639.785505] [2048.2] [pid=29143] Processed 'HOSTNAME', Free: 0, Cleaning options: 0
[1544302639.785514] [2048.2] [pid=29143] Uncleaned macro. Running output (71): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc'
[1544302639.785521] [2048.2] [pid=29143] Just finished macro. Running output (71): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc'
[1544302639.785529] [2048.2] [pid=29143] Processing part: ' SERVICEDESC::'
[1544302639.785537] [2048.2] [pid=29143] Not currently in macro. Running output (85): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::'
[1544302639.785545] [2048.2] [pid=29143] Processing part: 'SERVICEDESC'
[1544302639.785553] [2048.2] [pid=29143] macros[3] (SERVICEDESC) match.
[1544302639.785561] [2048.2] [pid=29143] Processed 'SERVICEDESC', Free: 0
[1544302639.785569] [2048.2] [pid=29143] Processed 'SERVICEDESC', Free: 0, Cleaning options: 0
[1544302639.785577] [2048.2] [pid=29143] Uncleaned macro. Running output (104): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129'
[1544302639.785585] [2048.2] [pid=29143] Just finished macro. Running output (104): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129'
[1544302639.785593] [2048.2] [pid=29143] Processing part: ' SERVICEPERFDATA::'
[1544302639.785601] [2048.2] [pid=29143] Not currently in macro. Running output (122): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::'
[1544302639.785610] [2048.2] [pid=29143] Processing part: 'SERVICEPERFDATA'
[1544302639.785618] [2048.2] [pid=29143] macros[19] (SERVICEPERFDATA) match.
[1544302639.785625] [2048.2] [pid=29143] Processed 'SERVICEPERFDATA', Free: 0
[1544302639.785633] [2048.2] [pid=29143] Processed 'SERVICEPERFDATA', Free: 0, Cleaning options: 0
[1544302639.785642] [2048.2] [pid=29143] Uncleaned macro. Running output (200): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;'
[1544302639.785674] [2048.2] [pid=29143] Just finished macro. Running output (200): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;'
[1544302639.785683] [2048.2] [pid=29143] Processing part: ' SERVICECHECKCOMMAND::'
[1544302639.785692] [2048.2] [pid=29143] Not currently in macro. Running output (222): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::'
[1544302639.785700] [2048.2] [pid=29143] Processing part: 'SERVICECHECKCOMMAND'
[1544302639.785708] [2048.2] [pid=29143] macros[60] (SERVICECHECKCOMMAND) match.
[1544302639.785716] [2048.2] [pid=29143] Processed 'SERVICECHECKCOMMAND', Free: 0
[1544302639.785724] [2048.2] [pid=29143] Processed 'SERVICECHECKCOMMAND', Free: 0, Cleaning options: 0
[1544302639.785732] [2048.2] [pid=29143] Uncleaned macro. Running output (265): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2'
[1544302639.785741] [2048.2] [pid=29143] Just finished macro. Running output (265): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2'
[1544302639.785748] [2048.2] [pid=29143] Processing part: ' HOSTSTATE::'
[1544302639.785756] [2048.2] [pid=29143] Not currently in macro. Running output (277): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::'
[1544302639.785764] [2048.2] [pid=29143] Processing part: 'HOSTSTATE'
[1544302639.785772] [2048.2] [pid=29143] macros[26] (HOSTSTATE) match.
[1544302639.785781] [2048.2] [pid=29143] Processed 'HOSTSTATE', Free: 0
[1544302639.785788] [2048.2] [pid=29143] Processed 'HOSTSTATE', Free: 0, Cleaning options: 0
[1544302639.785797] [2048.2] [pid=29143] Uncleaned macro. Running output (279): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP'
[1544302639.785805] [2048.2] [pid=29143] Just finished macro. Running output (279): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP'
[1544302639.785813] [2048.2] [pid=29143] Processing part: ' HOSTSTATETYPE::'
[1544302639.785821] [2048.2] [pid=29143] Not currently in macro. Running output (295): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::'
[1544302639.785829] [2048.2] [pid=29143] Processing part: 'HOSTSTATETYPE'
[1544302639.785837] [2048.2] [pid=29143] macros[41] (HOSTSTATETYPE) match.
[1544302639.785844] [2048.2] [pid=29143] Processed 'HOSTSTATETYPE', Free: 0
[1544302639.785857] [2048.2] [pid=29143] Processed 'HOSTSTATETYPE', Free: 0, Cleaning options: 0
[1544302639.785867] [2048.2] [pid=29143] Uncleaned macro. Running output (299): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::HARD'
[1544302639.785875] [2048.2] [pid=29143] Just finished macro. Running output (299): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::HARD'
[1544302639.785883] [2048.2] [pid=29143] Processing part: ' SERVICESTATE::'
[1544302639.785891] [2048.2] [pid=29143] Not currently in macro. Running output (314): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::HARD SERVICESTATE::'
[1544302639.785899] [2048.2] [pid=29143] Processing part: 'SERVICESTATE'
[1544302639.785907] [2048.2] [pid=29143] macros[4] (SERVICESTATE) match.
[1544302639.785915] [2048.2] [pid=29143] Processed 'SERVICESTATE', Free: 0
[1544302639.785922] [2048.2] [pid=29143] Processed 'SERVICESTATE', Free: 0, Cleaning options: 0
[1544302639.785930] [2048.2] [pid=29143] Uncleaned macro. Running output (321): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::HARD SERVICESTATE::WARNING'
[1544302639.785938] [2048.2] [pid=29143] Just finished macro. Running output (321): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::HARD SERVICESTATE::WARNING'
[1544302639.785946] [2048.2] [pid=29143] Processing part: ' SERVICESTATETYPE::'
[1544302639.785954] [2048.2] [pid=29143] Not currently in macro. Running output (340): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::HARD SERVICESTATE::WARNING SERVICESTATETYPE::'
[1544302639.785963] [2048.2] [pid=29143] Processing part: 'SERVICESTATETYPE'
[1544302639.785971] [2048.2] [pid=29143] macros[42] (SERVICESTATETYPE) match.
[1544302639.785979] [2048.2] [pid=29143] Processed 'SERVICESTATETYPE', Free: 0
[1544302639.785986] [2048.2] [pid=29143] Processed 'SERVICESTATETYPE', Free: 0, Cleaning options: 0
[1544302639.785994] [2048.2] [pid=29143] Uncleaned macro. Running output (344): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::HARD SERVICESTATE::WARNING SERVICESTATETYPE::SOFT'
[1544302639.786002] [2048.2] [pid=29143] Just finished macro. Running output (344): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::HARD SERVICESTATE::WARNING SERVICESTATETYPE::SOFT'
[1544302639.786021] [2048.2] [pid=29143] Processing part: ' SERVICEOUTPUT::'
[1544302639.786039] [2048.2] [pid=29143] Not currently in macro. Running output (360): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::HARD SERVICESTATE::WARNING SERVICESTATETYPE::SOFT SERVICEOUTPUT::'
[1544302639.786048] [2048.2] [pid=29143] Processing part: 'SERVICEOUTPUT'
[1544302639.786056] [2048.2] [pid=29143] macros[17] (SERVICEOUTPUT) match.
[1544302639.786064] [2048.2] [pid=29143] Processed 'SERVICEOUTPUT', Free: 0
[1544302639.786072] [2048.2] [pid=29143] Processed 'SERVICEOUTPUT', Free: 0, Cleaning options: 0
[1544302639.786080] [2048.2] [pid=29143] Uncleaned macro. Running output (407): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::HARD SERVICESTATE::WARNING SERVICESTATETYPE::SOFT SERVICEOUTPUT::WARNING - 10.130.241.129: rta 2.900ms, lost 10%'
[1544302639.786089] [2048.2] [pid=29143] Just finished macro. Running output (407): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::HARD SERVICESTATE::WARNING SERVICESTATETYPE::SOFT SERVICEOUTPUT::WARNING - 10.130.241.129: rta 2.900ms, lost 10%'
[1544302639.786097] [2048.2] [pid=29143] Processing part: '
'
[1544302639.786105] [2048.2] [pid=29143] Not currently in macro. Running output (408): 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::HARD SERVICESTATE::WARNING SERVICESTATETYPE::SOFT SERVICEOUTPUT::WARNING - 10.130.241.129: rta 2.900ms, lost 10%
'
[1544302639.786113] [2048.1] [pid=29143] Done. Final output: 'DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::HARD SERVICESTATE::WARNING SERVICESTATETYPE::SOFT SERVICEOUTPUT::WARNING - 10.130.241.129: rta 2.900ms, lost 10%
'
[1544302639.786121] [2048.1] [pid=29143] **** END MACRO PROCESSING *************
[1544302639.786129] [016.2] [pid=29143] Processed service performance data file output: DATATYPE::SERVICEPERFDATA TIMET::1544302639 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.129 SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::HARD SERVICESTATE::WARNING SERVICESTATETYPE::SOFT SERVICEOUTPUT::WARNING - 10.130.241.129: rta 2.900ms, lost 10%
From that time the performance data processing continues normally, even with the status of a service changed back to OK.
from looking into the code, there is no reason why this could happen except the process performance data flag has been disabled.
@VladimirBilik which application consumes your performance data file? And what is the target database? (Rrdtool, Graphite, etc... )?
I use pnp4nagios in Bulk Mode - config is:
process_performance_data=1
host_perfdata_file=/var/data/naemon/var-ramdisk/perfdata/host-perfdata
service_perfdata_file=/var/data/naemon/var-ramdisk/perfdata/service-perfdata
host_perfdata_file_mode=a
service_perfdata_file_mode=a
host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tHOSTOUTPUT::$HOSTOUTPUT$
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$\tSERVICEOUTPUT::$SERVICEOUTPUT$
host_perfdata_file_processing_interval=60
service_perfdata_file_processing_interval=60
host_perfdata_file_processing_command=process-host-perfdata-file
service_perfdata_file_processing_command=process-service-perfdata-file
This configuration worked perfectly until update to 1.0.8. I noticed a problem when I saw 'holes' in the graphs, like this: I use ramdisk of size 1G for speedup storing performance data (/var/data/naemon/var-ramdisk/perfdata/service-perfdata) with enough freespace on it. It was first thing I checked. Should I try running PNP4Nagios in different mode? I.e. Bulk Mode with NPCD?
Have you already tried to uses mod_gearman for processing of performance data? https://labs.consol.de/de/nagios/mod-gearman/index.html#_server_options https://docs.pnp4nagios.org/pnp-0.6/modes#gearman_mode
Fortunately I don't use NPCD, process_perfdata.pl and Naemons in-build process_performance_data
for years anymore.
I always used Bulk Mode with NPCD
. but you realy need to take care that NPCD is running, otherwise you will run out of disk inodes.
Did you already run Naemon in foreground to see if there will be any errors reported like so?
sudo -u naemon /bin/bash
/opt/naemon/bin/naemon /opt/naemon/etc/naemon/naemon.cfg
Maybe you see something like Unable to fork: Cannot allocate memory
I've just started naemon in foreground with screen session. I'll need some time to catch a service which stops writing performance data now. The last one will be probably OK after restart.
Regarding PNP with mod_gearman: it is a bit complicated to compile it in my environment, so I will try the NPCD.
According the Naemon source code there is nothing between calling
check_for_service_flapping()
check_for_host_flapping()
and
update_service_performance_data()
I'm looking into _update_service_performancedata. Giving assumption that process_performance_data=TRUE and svc->process_performance_data=TRUE in my static Naemon configuration, there is just one more check for _service_perfdata_process_emptyresults. I will set it temporary to TRUE. If it helps, I suppose there will be something in one of these conditions:
if (!svc || !svc->perf_data || !*svc->perf_data) {
return OK;
}
if ((!service_perfdata_file_template) && !service_perfdata_command) {
return OK;
}
What do you think?
I quickly updated my Naemon demo system from 1.0.3 to 1.0.8. (I'am not using mod_gearman)
I'm used to get my performance data from this boy: https://github.com/naemon/naemon-core/blob/1d420731922a0b916f08f511dcad6a44863d3b2f/src/naemon/checks_service.c#L1007-L1022
which is just a few lines above update_service_performance_data()
So, should be nearly impossible that performance data gets vanished between those lines.
My system is now running at Naemon 1.0.8 since Tue 2018-12-11 20:49:24 CET
. I will check for missing performance data any now and than. The system is also publicly available if you want to take a look.
My assumption about settings of _service_perfdata_process_emptyresults was wrong, I see that it is set to 1 by default. So the mentioned part of code with conditions is never run. Back to my observations:
Regarding testing NPCD mode: a version distributed with pnp4nagios (pnp4nagios-0.6.25-1.el7.x86_64) have npcdmod.o just for Nagios, which doesn't work with my Naemon:
Error: Could not load module '/usr/lib64/nagios/brokers/npcdmod.o' -> /usr/lib64/nagios/brokers/npcdmod.o: undefined symbol: schedule_new_event
so I have to compile a version of mine (a bit complication with maintenance)
As far as I know does npcdmod.o
only support Nagios 3. There is no version for Nagios 4 or Naemon available.
From the docs:
Attention Starting with Nagios 4 the internal structures have changed so the start of the module will fail. So far there are no plans to support Nagios 4. Please select any other of the modes.
while thats true, its not too complicated to get pnp running with naemon, we build the npcdmod module in omd for naemon like this: https://github.com/ConSol/omd/tree/labs/packages/pnp4nagios4
@VladimirBilik were you able to find the root cause for the missing performance data?
Unfortunately no. I've tested npcdmod.o (@sni - I have pulled it from your OMD distribution) which worked fine. But I noticed that this module writes performance data in format:
DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\t\
SERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\t\
HOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\t\
SERVICESTATETYPE::$SERVICESTATETYPE$
while I need to have performance data in format defined in _service_perfdata_filetemplate:
DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\t\
SERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\t\
HOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\t\
SERVICESTATETYPE::$SERVICESTATETYPE$\tSERVICEOUTPUT::$SERVICEOUTPUT$
i.e. there is missing the $SERVICEOUTPUT$ macro in performance data produced by npcdmod. For testing purposes I changed my file template so there was no $SERVICEOUTPUT$ macro in it. After 24hours of observation there was just one occurence of missing data. In comparison to previous results this was much better behaviour. Unfortunately - I need to have $SERVICEOUTPUT$ in our graphs, so I switched back to previous definition of data file template.
Yesterday I have switched to new core 1.0.9 and performace data template with $SERVICEOUTPUT$ in it. Till now I noticed just one service with missing data, so I can tell it is much better.
But I still have no idea why it is happens. Most strange for me is that when some service stops getting data, after forcing check (just once) it is switched to normal behaviour with full data in _service_perfdatafile I thing it could have something to do with macro processing - when I change the data file template so it has no $SERVICEOUTPUT$, it behaves almost normally - just one exception in 24hours. Also - when I reduced the number of services defined in naemon to 100, I noticed no data loss. My system is not overloaded even with 2000 monitored services, it has enough memory and there is no killed processes recorded in system logs. @nook24 - I see that in your demo everything works fine, without data loss. Please, could you test your system with more services defined? For example 2000 of pings with collecting performance data at least for 24hours?
@VladimirBilik I'm on vacation and don't have any access to some powerful hardware... The demo system is just a cheap VPS where i can't deploy such a load load.
Have you already tried to strip your system down and removed all broker modules?
Running naemon 1.0.8 with mod_gearman 3.0.7, I have problem with lost performance data occasionally. According to debug the data are delivered from plugin to Naemon, but after some processing they are not written into performance data file. This happens irregularrly. If the data is getting lost on some service it keeps loosing on it until I reload the naemon or force the check of service manually.
I have configured host + services like this:
With debugging turned on the correct check of service looks like this:
the check of service with lost data looks like this:
I.e. it seems like the plugin output is not processed after service and host flapping checks.
It happens on various plugins - it is not related to particular one. System does not report any plugin/process crash, no memory leaks (32G RAM/4G RAM is still free). Any idea what else to check?
Thanks.