naemon / naemon-core

Networks, Applications and Event Monitor
http://www.naemon.io/
GNU General Public License v2.0
154 stars 63 forks source link

Lost performance data in file output #273

Open VladimirBilik opened 5 years ago

VladimirBilik commented 5 years ago

Running naemon 1.0.8 with mod_gearman 3.0.7, I have problem with lost performance data occasionally. According to debug the data are delivered from plugin to Naemon, but after some processing they are not written into performance data file. This happens irregularrly. If the data is getting lost on some service it keeps loosing on it until I reload the naemon or force the check of service manually.

I have configured host + services like this:

define host {
    host_name              gwco.dc
    address                1.2.3.4
    check_command          check-iphost-alive!20,5%!50,15%!10!2
}

define service {
    use                 IPSERVICE_ICMP
    host_name           gwco.dc
    service_description ping 10.130.241.201
    check_command       check-icmp!10.130.241.201!20,5%!50,15%!10!2
}

define service {
    use                 IPSERVICE_ICMP
    host_name           gwco.dc
    service_description ping 10.130.241.129
    check_command       check-icmp!10.130.241.129!20,5%!50,15%!10!2
}

With debugging turned on the correct check of service looks like this:

[1544168150.936756] [016.0] [pid=29143] Check of service 'ping 10.130.241.201' on host 'gwco.dc' (id=611) was overridden by a module
[1544168150.936784] [064.1] [pid=29143] Making callbacks (type 13)...

[1544168151.105062] [016.2] [pid=29143] Processing check result for service 'ping 10.130.241.201' on host 'gwco.dc'
[1544168151.105116] [016.0] [pid=29143] ** Handling check result for service 'ping 10.130.241.201' on host 'gwco.dc' from 'Mod-Gearman Worker'...
[1544168151.105126] [016.1] [pid=29143] HOST: gwco.dc, SERVICE: ping 10.130.241.201, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, EXITED OK: Yes, RETURN CODE: 0, OUTPUT: OK - 10.130.241.201: rta 0.862ms, lost 0%|rta=0.862ms;20.000;50.000;0; pl=0%;5;15;; rtmax=1.036ms;;;; rtmin=0.740ms;;;; 

[1544168151.105152] [016.2] [pid=29143] Parsing check output...
[1544168151.105160] [016.2] [pid=29143] Short Output: OK - 10.130.241.201: rta 0.862ms, lost 0%
[1544168151.105168] [016.2] [pid=29143] Long Output:  NULL
[1544168151.105176] [016.2] [pid=29143] Perf Data:    rta=0.862ms;20.000;50.000;0; pl=0%;5;15;; rtmax=1.036ms;;;; rtmin=0.740ms;;;;
[1544168151.105184] [016.2] [pid=29143] ST: HARD  CA: 1  MA: 5  CS: 0  LS: 0  LHS: 0
[1544168151.105192] [016.1] [pid=29143] Service is OK.
[1544168151.105200] [016.1] [pid=29143] Service did not change state.
[1544168151.105245] [064.1] [pid=29143] Making callbacks (type 6)...
[1544168151.105262] [064.2] [pid=29143] Callback #1 (type 6) return code = 0
[1544168151.105274] [064.2] [pid=29143] Callback #2 (type 6) return code = 0
[1544168151.105285] [064.1] [pid=29143] Making callbacks (type 13)...
[1544168151.105293] [016.1] [pid=29143] Checking service 'ping 10.130.241.201' on host 'gwco.dc' for flapping...
[1544168151.105302] [016.2] [pid=29143] LFT=10.00, HFT=40.00, CPC=0.00, PSC=0.00%
[1544168151.105321] [016.1] [pid=29143] Service is not flapping (0.00% state change).
[1544168151.105330] [016.1] [pid=29143] Checking host 'gwco.dc' for flapping...
[1544168151.105338] [016.2] [pid=29143] LFT=10.00, HFT=40.00, CPC=0.00, PSC=0.00%
[1544168151.105348] [016.1] [pid=29143] Host is not flapping (0.00% state change).
[1544168151.105383] [016.2] [pid=29143] Raw service performance data file output: DATATYPE::SERVICEPERFDATA TIMET::$TIMET$ HOSTNAME::$HOSTNAME$ SERVICEDESC::$SERVICEDESC$ SERVICEPERFDATA::$SERVICEPERFDATA$ SERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$ HOSTSTATE::$HOSTSTATE$ HOSTSTATETYPE::$HOSTSTATETYPE$ SERVICESTATE::$SERVICESTATE$ SERVICESTATETYPE::$SERVICESTATETYPE$ SERVICEOUTPUT::$SERVICEOUTPUT$

[1544168151.105402] [2048.1] [pid=29143] **** BEGIN MACRO PROCESSING ***********
[1544168151.105436] [2048.1] [pid=29143] Processing: 'DATATYPE::SERVICEPERFDATA TIMET::$TIMET$ HOSTNAME::$HOSTNAME$ SERVICEDESC::$SERVICEDESC$ SERVICEPERFDATA::$SERVICEPERFDATA$ SERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$ HOSTSTATE::$HOSTSTATE$ HOSTSTATETYPE::$HOSTSTATETYPE$ SERVICESTATE::$SERVICESTATE$ SERVICESTATETYPE::$SERVICESTATETYPE$ SERVICEOUTPUT::$SERVICEOUTPUT$
'
[1544168151.105447] [2048.2] [pid=29143]   Processing part: 'DATATYPE::SERVICEPERFDATA TIMET::'
[1544168151.105457] [2048.2] [pid=29143]   Not currently in macro.  Running output (33): 'DATATYPE::SERVICEPERFDATA TIMET::'
[1544168151.105465] [2048.2] [pid=29143]   Processing part: 'TIMET'
[1544168151.105475] [2048.2] [pid=29143]   macros[11] (TIMET) match.
[1544168151.105485] [2048.2] [pid=29143]   Processed 'TIMET', Free: 1
[1544168151.105497] [2048.2] [pid=29143]   Processed 'TIMET', Free: 1,  Cleaning options: 0
[1544168151.105506] [2048.2] [pid=29143]   Uncleaned macro.  Running output (43): 'DATATYPE::SERVICEPERFDATA TIMET::1544168151'
[1544168151.105514] [2048.2] [pid=29143]   Just finished macro.  Running output (43): 'DATATYPE::SERVICEPERFDATA TIMET::1544168151'
...messages from macro processing...
[1544168151.106201] [2048.1] [pid=29143]   Done.  Final output: 'DATATYPE::SERVICEPERFDATA TIMET::1544168151 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.201 SERVICEPERFDATA::rta=0.862ms;20.000;50.000;0; pl=0%;5;15;; rtmax=1.036ms;;;; rtmin=0.740ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.201!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::HARD SERVICESTATE::OK SERVICESTATETYPE::HARD SERVICEOUTPUT::OK - 10.130.241.201: rta 0.862ms, lost 0%
'
[1544168151.106223] [2048.1] [pid=29143] **** END MACRO PROCESSING *************
[1544168151.106231] [016.2] [pid=29143] Processed service performance data file output: DATATYPE::SERVICEPERFDATA TIMET::1544168151 HOSTNAME::gwco.dc SERVICEDESC::ping 10.130.241.201 SERVICEPERFDATA::rta=0.862ms;20.000;50.000;0; pl=0%;5;15;; rtmax=1.036ms;;;; rtmin=0.740ms;;;; SERVICECHECKCOMMAND::check-icmp!10.130.241.201!20,5%!50,15%!10!2 HOSTSTATE::UP HOSTSTATETYPE::HARD SERVICESTATE::OK SERVICESTATETYPE::HARD SERVICEOUTPUT::OK - 10.130.241.201: rta 0.862ms, lost 0%

the check of service with lost data looks like this:

[1544168237.078812] [016.0] [pid=29143] Check of service 'ping 10.130.241.129' on host 'gwco.dc' (id=610) was overridden by a module
[1544168237.078862] [064.1] [pid=29143] Making callbacks (type 13)...
[1544168237.078926] [064.1] [pid=29143] Making callbacks (type 13)...
[1544168237.264200] [016.2] [pid=29143] Processing check result for service 'ping 10.130.241.129' on host 'gwco.dc'
[1544168237.264252] [016.0] [pid=29143] ** Handling check result for service 'ping 10.130.241.129' on host 'gwco.dc' from 'Mod-Gearman Worker'...
[1544168237.264262] [016.1] [pid=29143] HOST: gwco.dc, SERVICE: ping 10.130.241.129, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, EXITED OK: Yes, RETURN CODE: 0, OUTPUT: OK - 10.130.241.129: rta 1.809ms, lost 0%|rta=1.809ms;20.000;50.000;0; pl=0%;5;15;; rtmax=2.990ms;;;; rtmin=1.325ms;;;; 

[1544168237.264293] [016.2] [pid=29143] Parsing check output...
[1544168237.264302] [016.2] [pid=29143] Short Output: OK - 10.130.241.129: rta 1.809ms, lost 0%
[1544168237.264310] [016.2] [pid=29143] Long Output:  NULL
[1544168237.264318] [016.2] [pid=29143] Perf Data:    rta=1.809ms;20.000;50.000;0; pl=0%;5;15;; rtmax=2.990ms;;;; rtmin=1.325ms;;;;
[1544168237.264326] [016.2] [pid=29143] ST: HARD  CA: 1  MA: 5  CS: 0  LS: 0  LHS: 0
[1544168237.264334] [016.1] [pid=29143] Service is OK.
[1544168237.264341] [016.1] [pid=29143] Service did not change state.
[1544168237.264352] [064.1] [pid=29143] Making callbacks (type 6)...
[1544168237.264365] [064.2] [pid=29143] Callback #1 (type 6) return code = 0
[1544168237.264377] [064.2] [pid=29143] Callback #2 (type 6) return code = 0
[1544168237.264388] [064.1] [pid=29143] Making callbacks (type 13)...
[1544168237.264396] [016.1] [pid=29143] Checking service 'ping 10.130.241.129' on host 'gwco.dc' for flapping...
[1544168237.264405] [016.2] [pid=29143] LFT=10.00, HFT=40.00, CPC=0.00, PSC=0.00%
[1544168237.264419] [016.1] [pid=29143] Service is not flapping (0.00% state change).
[1544168237.264428] [016.1] [pid=29143] Checking host 'gwco.dc' for flapping...
[1544168237.264436] [016.2] [pid=29143] LFT=10.00, HFT=40.00, CPC=0.00, PSC=0.00%
[1544168237.264446] [016.1] [pid=29143] Host is not flapping (0.00% state change).

I.e. it seems like the plugin output is not processed after service and host flapping checks.

It happens on various plugins - it is not related to particular one. System does not report any plugin/process crash, no memory leaks (32G RAM/4G RAM is still free). Any idea what else to check?

Thanks.

sni commented 5 years ago

could you double check if performance data processing is still enabled when that happens? It can be disabled on the "Process Info" page. Or check the logs for external commands in that direction.

VladimirBilik commented 5 years ago

Processing is enabled - other services are OK at the same time. Log does not show any changes on this attribute.

I have another observation - after change in service state the performance data started to be processed again. Debug info: 1) at 1544302037 [Sat Dec 8 21:47:17 CET 2018] the service was in OK state. Performance data was not processed at this time:

[1544302037.528470] [016.0] [pid=29143] Check of service 'ping 10.130.241.129' on host 'gwco.dc' (id=610) was overridden by a module
[1544302037.591615] [016.2] [pid=29143] Processing check result for service 'ping 10.130.241.129' on host 'gwco.dc'
[1544302037.591648] [016.0] [pid=29143] ** Handling check result for service 'ping 10.130.241.129' on host 'gwco.dc' from 'Mod-Gearman Worker'...
[1544302037.591658] [016.1] [pid=29143] HOST: gwco.dc, SERVICE: ping 10.130.241.129, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, EXITED OK: Yes, RETURN CODE: 0, OUTPUT: OK - 10.130.241.129: rta 1.434ms, lost 0%|rta=1.434ms;20.000;50.000;0; pl=0%;5;15;; rtmax=2.037ms;;;; rtmin=1.315ms;;;; 

[1544302037.591681] [016.2] [pid=29143] Parsing check output...
[1544302037.591689] [016.2] [pid=29143] Short Output: OK - 10.130.241.129: rta 1.434ms, lost 0%
[1544302037.591697] [016.2] [pid=29143] Long Output:  NULL
[1544302037.591705] [016.2] [pid=29143] Perf Data:    rta=1.434ms;20.000;50.000;0; pl=0%;5;15;; rtmax=2.037ms;;;; rtmin=1.315ms;;;;
[1544302037.591713] [016.2] [pid=29143] ST: HARD  CA: 1  MA: 5  CS: 0  LS: 0  LHS: 0
[1544302037.591722] [016.1] [pid=29143] Service is OK.
[1544302037.591729] [016.1] [pid=29143] Service did not change state.
[1544302037.591740] [064.1] [pid=29143] Making callbacks (type 6)...
[1544302037.591753] [064.2] [pid=29143] Callback #1 (type 6) return code = 0
[1544302037.591764] [064.2] [pid=29143] Callback #2 (type 6) return code = 0
[1544302037.591774] [064.1] [pid=29143] Making callbacks (type 13)...
[1544302037.591783] [016.1] [pid=29143] Checking service 'ping 10.130.241.129' on host 'gwco.dc' for flapping...
[1544302037.591792] [016.2] [pid=29143] LFT=10.00, HFT=40.00, CPC=0.00, PSC=0.00%
[1544302037.591804] [016.1] [pid=29143] Service is not flapping (0.00% state change).
[1544302037.591813] [016.1] [pid=29143] Checking host 'gwco.dc' for flapping...
[1544302037.591821] [016.2] [pid=29143] LFT=10.00, HFT=40.00, CPC=0.00, PSC=0.00%
[1544302037.591831] [016.1] [pid=29143] Host is not flapping (0.00% state change).

2) at 1544302637 [Sat Dec 8 21:57:17 CET 2018] status of service has been changed to WARNING and performance data were processed

[1544302637.526849] [016.0] [pid=29143] Check of service 'ping 10.130.241.129' on host 'gwco.dc' (id=610) was overridden by a module
[1544302639.784929] [016.2] [pid=29143] Processing check result for service 'ping 10.130.241.129' on host 'gwco.dc'
[1544302639.784985] [016.0] [pid=29143] ** Handling check result for service 'ping 10.130.241.129' on host 'gwco.dc' from 'Mod-Gearman Worker'...
[1544302639.784996] [016.1] [pid=29143] HOST: gwco.dc, SERVICE: ping 10.130.241.129, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, EXITED OK: Yes, RETURN CODE: 1, OUTPUT: WARNING - 10.130.241.129: rta 2.900ms, lost 10%|rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;; 

[1544302639.785020] [016.2] [pid=29143] Parsing check output...
[1544302639.785030] [016.2] [pid=29143] Short Output: WARNING - 10.130.241.129: rta 2.900ms, lost 10%
[1544302639.785038] [016.2] [pid=29143] Long Output:  NULL
[1544302639.785046] [016.2] [pid=29143] Perf Data:    rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;
[1544302639.785055] [016.2] [pid=29143] ST: HARD  CA: 1  MA: 5  CS: 1  LS: 0  LHS: 0
[1544302639.785063] [016.2] [pid=29143] Service has changed state since last check!
[1544302639.785071] [016.1] [pid=29143] Service is in a non-OK state!
[1544302639.785079] [016.1] [pid=29143] Host is currently UP, so we'll recheck its state to make sure...
[1544302639.785092] [064.1] [pid=29143] Making callbacks (type 12)...
[1544302639.785109] [064.2] [pid=29143] Callback #1 (type 12) return code = 0
[1544302639.785119] [016.1] [pid=29143] Current/Max Attempt(s): 1/5
[1544302639.785127] [016.1] [pid=29143] Host is UP, so we'll retry the service check...
[1544302639.785190] [064.1] [pid=29143] Making callbacks (type 2)...
[1544302639.785224] [064.2] [pid=29143] Callback #1 (type 2) return code = 0
[1544302639.785238] [064.1] [pid=29143] Making callbacks (type 23)...
[1544302639.785248] [064.2] [pid=29143] Callback #1 (type 23) return code = 0
[1544302639.785262] [064.1] [pid=29143] Making callbacks (type 13)...
[1544302639.785275] [064.1] [pid=29143] Making callbacks (type 6)...
[1544302639.785286] [064.2] [pid=29143] Callback #1 (type 6) return code = 0
[1544302639.785297] [064.2] [pid=29143] Callback #2 (type 6) return code = 0
[1544302639.785306] [064.1] [pid=29143] Making callbacks (type 13)...
[1544302639.785315] [016.1] [pid=29143] Checking service 'ping 10.130.241.129' on host 'gwco.dc' for flapping...
[1544302639.785323] [016.1] [pid=29143] Checking host 'gwco.dc' for flapping...
[1544302639.785331] [016.2] [pid=29143] LFT=10.00, HFT=40.00, CPC=0.00, PSC=0.00%
[1544302639.785344] [016.1] [pid=29143] Host is not flapping (0.00% state change).
[1544302639.785357] [016.2] [pid=29143] Raw service performance data file output: DATATYPE::SERVICEPERFDATA     TIMET::$TIMET$  HOSTNAME::$HOSTNAME$    SERVICEDESC::$SERVICEDESC$      SERVICEPERFDATA::$SERVICEPERFDATA$      SERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$    HOSTSTATE::$HOSTSTATE$   HOSTSTATETYPE::$HOSTSTATETYPE$  SERVICESTATE::$SERVICESTATE$    SERVICESTATETYPE::$SERVICESTATETYPE$    SERVICEOUTPUT::$SERVICEOUTPUT$

[1544302639.785380] [2048.1] [pid=29143] **** BEGIN MACRO PROCESSING ***********
[1544302639.785388] [2048.1] [pid=29143] Processing: 'DATATYPE::SERVICEPERFDATA TIMET::$TIMET$  HOSTNAME::$HOSTNAME$    SERVICEDESC::$SERVICEDESC$      SERVICEPERFDATA::$SERVICEPERFDATA$      SERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$      HOSTSTATE::$HOSTSTATE$  HOSTSTATETYPE::$HOSTSTATETYPE$ SERVICESTATE::$SERVICESTATE$    SERVICESTATETYPE::$SERVICESTATETYPE$    SERVICEOUTPUT::$SERVICEOUTPUT$
'
[1544302639.785397] [2048.2] [pid=29143]   Processing part: 'DATATYPE::SERVICEPERFDATA  TIMET::'
[1544302639.785406] [2048.2] [pid=29143]   Not currently in macro.  Running output (33): 'DATATYPE::SERVICEPERFDATA     TIMET::'
[1544302639.785414] [2048.2] [pid=29143]   Processing part: 'TIMET'
[1544302639.785425] [2048.2] [pid=29143]   macros[11] (TIMET) match.
[1544302639.785434] [2048.2] [pid=29143]   Processed 'TIMET', Free: 1
[1544302639.785442] [2048.2] [pid=29143]   Processed 'TIMET', Free: 1,  Cleaning options: 0
[1544302639.785450] [2048.2] [pid=29143]   Uncleaned macro.  Running output (43): 'DATATYPE::SERVICEPERFDATA    TIMET::1544302639'
[1544302639.785458] [2048.2] [pid=29143]   Just finished macro.  Running output (43): 'DATATYPE::SERVICEPERFDATA        TIMET::1544302639'
[1544302639.785466] [2048.2] [pid=29143]   Processing part: '   HOSTNAME::'
[1544302639.785473] [2048.2] [pid=29143]   Not currently in macro.  Running output (54): 'DATATYPE::SERVICEPERFDATA     TIMET::1544302639       HOSTNAME::'
[1544302639.785481] [2048.2] [pid=29143]   Processing part: 'HOSTNAME'
[1544302639.785489] [2048.2] [pid=29143]   macros[0] (HOSTNAME) match.
[1544302639.785498] [2048.2] [pid=29143]   Processed 'HOSTNAME', Free: 0
[1544302639.785505] [2048.2] [pid=29143]   Processed 'HOSTNAME', Free: 0,  Cleaning options: 0
[1544302639.785514] [2048.2] [pid=29143]   Uncleaned macro.  Running output (71): 'DATATYPE::SERVICEPERFDATA    TIMET::1544302639       HOSTNAME::gwco.dc'
[1544302639.785521] [2048.2] [pid=29143]   Just finished macro.  Running output (71): 'DATATYPE::SERVICEPERFDATA        TIMET::1544302639       HOSTNAME::gwco.dc'
[1544302639.785529] [2048.2] [pid=29143]   Processing part: '   SERVICEDESC::'
[1544302639.785537] [2048.2] [pid=29143]   Not currently in macro.  Running output (85): 'DATATYPE::SERVICEPERFDATA     TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::'
[1544302639.785545] [2048.2] [pid=29143]   Processing part: 'SERVICEDESC'
[1544302639.785553] [2048.2] [pid=29143]   macros[3] (SERVICEDESC) match.
[1544302639.785561] [2048.2] [pid=29143]   Processed 'SERVICEDESC', Free: 0
[1544302639.785569] [2048.2] [pid=29143]   Processed 'SERVICEDESC', Free: 0,  Cleaning options: 0
[1544302639.785577] [2048.2] [pid=29143]   Uncleaned macro.  Running output (104): 'DATATYPE::SERVICEPERFDATA   TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129'
[1544302639.785585] [2048.2] [pid=29143]   Just finished macro.  Running output (104): 'DATATYPE::SERVICEPERFDATA       TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129'
[1544302639.785593] [2048.2] [pid=29143]   Processing part: '   SERVICEPERFDATA::'
[1544302639.785601] [2048.2] [pid=29143]   Not currently in macro.  Running output (122): 'DATATYPE::SERVICEPERFDATA    TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::'
[1544302639.785610] [2048.2] [pid=29143]   Processing part: 'SERVICEPERFDATA'
[1544302639.785618] [2048.2] [pid=29143]   macros[19] (SERVICEPERFDATA) match.
[1544302639.785625] [2048.2] [pid=29143]   Processed 'SERVICEPERFDATA', Free: 0
[1544302639.785633] [2048.2] [pid=29143]   Processed 'SERVICEPERFDATA', Free: 0,  Cleaning options: 0
[1544302639.785642] [2048.2] [pid=29143]   Uncleaned macro.  Running output (200): 'DATATYPE::SERVICEPERFDATA   TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;'
[1544302639.785674] [2048.2] [pid=29143]   Just finished macro.  Running output (200): 'DATATYPE::SERVICEPERFDATA       TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;'
[1544302639.785683] [2048.2] [pid=29143]   Processing part: '   SERVICECHECKCOMMAND::'
[1544302639.785692] [2048.2] [pid=29143]   Not currently in macro.  Running output (222): 'DATATYPE::SERVICEPERFDATA    TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::'
[1544302639.785700] [2048.2] [pid=29143]   Processing part: 'SERVICECHECKCOMMAND'
[1544302639.785708] [2048.2] [pid=29143]   macros[60] (SERVICECHECKCOMMAND) match.
[1544302639.785716] [2048.2] [pid=29143]   Processed 'SERVICECHECKCOMMAND', Free: 0
[1544302639.785724] [2048.2] [pid=29143]   Processed 'SERVICECHECKCOMMAND', Free: 0,  Cleaning options: 0
[1544302639.785732] [2048.2] [pid=29143]   Uncleaned macro.  Running output (265): 'DATATYPE::SERVICEPERFDATA   TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2'
[1544302639.785741] [2048.2] [pid=29143]   Just finished macro.  Running output (265): 'DATATYPE::SERVICEPERFDATA       TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2'
[1544302639.785748] [2048.2] [pid=29143]   Processing part: '   HOSTSTATE::'
[1544302639.785756] [2048.2] [pid=29143]   Not currently in macro.  Running output (277): 'DATATYPE::SERVICEPERFDATA    TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::'
[1544302639.785764] [2048.2] [pid=29143]   Processing part: 'HOSTSTATE'
[1544302639.785772] [2048.2] [pid=29143]   macros[26] (HOSTSTATE) match.
[1544302639.785781] [2048.2] [pid=29143]   Processed 'HOSTSTATE', Free: 0
[1544302639.785788] [2048.2] [pid=29143]   Processed 'HOSTSTATE', Free: 0,  Cleaning options: 0
[1544302639.785797] [2048.2] [pid=29143]   Uncleaned macro.  Running output (279): 'DATATYPE::SERVICEPERFDATA   TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP'
[1544302639.785805] [2048.2] [pid=29143]   Just finished macro.  Running output (279): 'DATATYPE::SERVICEPERFDATA       TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP'
[1544302639.785813] [2048.2] [pid=29143]   Processing part: '   HOSTSTATETYPE::'
[1544302639.785821] [2048.2] [pid=29143]   Not currently in macro.  Running output (295): 'DATATYPE::SERVICEPERFDATA    TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP   HOSTSTATETYPE::'
[1544302639.785829] [2048.2] [pid=29143]   Processing part: 'HOSTSTATETYPE'
[1544302639.785837] [2048.2] [pid=29143]   macros[41] (HOSTSTATETYPE) match.
[1544302639.785844] [2048.2] [pid=29143]   Processed 'HOSTSTATETYPE', Free: 0
[1544302639.785857] [2048.2] [pid=29143]   Processed 'HOSTSTATETYPE', Free: 0,  Cleaning options: 0
[1544302639.785867] [2048.2] [pid=29143]   Uncleaned macro.  Running output (299): 'DATATYPE::SERVICEPERFDATA   TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP   HOSTSTATETYPE::HARD'
[1544302639.785875] [2048.2] [pid=29143]   Just finished macro.  Running output (299): 'DATATYPE::SERVICEPERFDATA       TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP   HOSTSTATETYPE::HARD'
[1544302639.785883] [2048.2] [pid=29143]   Processing part: '   SERVICESTATE::'
[1544302639.785891] [2048.2] [pid=29143]   Not currently in macro.  Running output (314): 'DATATYPE::SERVICEPERFDATA    TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP   HOSTSTATETYPE::HARD     SERVICESTATE::'
[1544302639.785899] [2048.2] [pid=29143]   Processing part: 'SERVICESTATE'
[1544302639.785907] [2048.2] [pid=29143]   macros[4] (SERVICESTATE) match.
[1544302639.785915] [2048.2] [pid=29143]   Processed 'SERVICESTATE', Free: 0
[1544302639.785922] [2048.2] [pid=29143]   Processed 'SERVICESTATE', Free: 0,  Cleaning options: 0
[1544302639.785930] [2048.2] [pid=29143]   Uncleaned macro.  Running output (321): 'DATATYPE::SERVICEPERFDATA   TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP   HOSTSTATETYPE::HARD     SERVICESTATE::WARNING'
[1544302639.785938] [2048.2] [pid=29143]   Just finished macro.  Running output (321): 'DATATYPE::SERVICEPERFDATA       TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP   HOSTSTATETYPE::HARD     SERVICESTATE::WARNING'
[1544302639.785946] [2048.2] [pid=29143]   Processing part: '   SERVICESTATETYPE::'
[1544302639.785954] [2048.2] [pid=29143]   Not currently in macro.  Running output (340): 'DATATYPE::SERVICEPERFDATA    TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP   HOSTSTATETYPE::HARD     SERVICESTATE::WARNING   SERVICESTATETYPE::'
[1544302639.785963] [2048.2] [pid=29143]   Processing part: 'SERVICESTATETYPE'
[1544302639.785971] [2048.2] [pid=29143]   macros[42] (SERVICESTATETYPE) match.
[1544302639.785979] [2048.2] [pid=29143]   Processed 'SERVICESTATETYPE', Free: 0
[1544302639.785986] [2048.2] [pid=29143]   Processed 'SERVICESTATETYPE', Free: 0,  Cleaning options: 0
[1544302639.785994] [2048.2] [pid=29143]   Uncleaned macro.  Running output (344): 'DATATYPE::SERVICEPERFDATA   TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP   HOSTSTATETYPE::HARD     SERVICESTATE::WARNING   SERVICESTATETYPE::SOFT'
[1544302639.786002] [2048.2] [pid=29143]   Just finished macro.  Running output (344): 'DATATYPE::SERVICEPERFDATA       TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP   HOSTSTATETYPE::HARD     SERVICESTATE::WARNING   SERVICESTATETYPE::SOFT'
[1544302639.786021] [2048.2] [pid=29143]   Processing part: '   SERVICEOUTPUT::'
[1544302639.786039] [2048.2] [pid=29143]   Not currently in macro.  Running output (360): 'DATATYPE::SERVICEPERFDATA    TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP   HOSTSTATETYPE::HARD     SERVICESTATE::WARNING   SERVICESTATETYPE::SOFT  SERVICEOUTPUT::'
[1544302639.786048] [2048.2] [pid=29143]   Processing part: 'SERVICEOUTPUT'
[1544302639.786056] [2048.2] [pid=29143]   macros[17] (SERVICEOUTPUT) match.
[1544302639.786064] [2048.2] [pid=29143]   Processed 'SERVICEOUTPUT', Free: 0
[1544302639.786072] [2048.2] [pid=29143]   Processed 'SERVICEOUTPUT', Free: 0,  Cleaning options: 0
[1544302639.786080] [2048.2] [pid=29143]   Uncleaned macro.  Running output (407): 'DATATYPE::SERVICEPERFDATA   TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP   HOSTSTATETYPE::HARD     SERVICESTATE::WARNING   SERVICESTATETYPE::SOFT  SERVICEOUTPUT::WARNING - 10.130.241.129: rta 2.900ms, lost 10%'
[1544302639.786089] [2048.2] [pid=29143]   Just finished macro.  Running output (407): 'DATATYPE::SERVICEPERFDATA       TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP   HOSTSTATETYPE::HARD     SERVICESTATE::WARNING   SERVICESTATETYPE::SOFT  SERVICEOUTPUT::WARNING - 10.130.241.129: rta 2.900ms, lost 10%'
[1544302639.786097] [2048.2] [pid=29143]   Processing part: '
'
[1544302639.786105] [2048.2] [pid=29143]   Not currently in macro.  Running output (408): 'DATATYPE::SERVICEPERFDATA    TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP   HOSTSTATETYPE::HARD     SERVICESTATE::WARNING   SERVICESTATETYPE::SOFT  SERVICEOUTPUT::WARNING - 10.130.241.129: rta 2.900ms, lost 10%
'
[1544302639.786113] [2048.1] [pid=29143]   Done.  Final output: 'DATATYPE::SERVICEPERFDATA      TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP   HOSTSTATETYPE::HARD     SERVICESTATE::WARNING   SERVICESTATETYPE::SOFT  SERVICEOUTPUT::WARNING - 10.130.241.129: rta 2.900ms, lost 10%
'
[1544302639.786121] [2048.1] [pid=29143] **** END MACRO PROCESSING *************
[1544302639.786129] [016.2] [pid=29143] Processed service performance data file output: DATATYPE::SERVICEPERFDATA       TIMET::1544302639       HOSTNAME::gwco.dc     SERVICEDESC::ping 10.130.241.129        SERVICEPERFDATA::rta=2.900ms;20.000;50.000;0; pl=10%;5;15;; rtmax=4.048ms;;;; rtmin=1.374ms;;;;        SERVICECHECKCOMMAND::check-icmp!10.130.241.129!20,5%!50,15%!10!2        HOSTSTATE::UP   HOSTSTATETYPE::HARD     SERVICESTATE::WARNING   SERVICESTATETYPE::SOFT  SERVICEOUTPUT::WARNING - 10.130.241.129: rta 2.900ms, lost 10%

From that time the performance data processing continues normally, even with the status of a service changed back to OK.

sni commented 5 years ago

from looking into the code, there is no reason why this could happen except the process performance data flag has been disabled.

nook24 commented 5 years ago

@VladimirBilik which application consumes your performance data file? And what is the target database? (Rrdtool, Graphite, etc... )?

VladimirBilik commented 5 years ago

I use pnp4nagios in Bulk Mode - config is:

process_performance_data=1

host_perfdata_file=/var/data/naemon/var-ramdisk/perfdata/host-perfdata
service_perfdata_file=/var/data/naemon/var-ramdisk/perfdata/service-perfdata

host_perfdata_file_mode=a
service_perfdata_file_mode=a

host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tHOSTOUTPUT::$HOSTOUTPUT$
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$\tSERVICEOUTPUT::$SERVICEOUTPUT$

host_perfdata_file_processing_interval=60
service_perfdata_file_processing_interval=60

host_perfdata_file_processing_command=process-host-perfdata-file
service_perfdata_file_processing_command=process-service-perfdata-file

This configuration worked perfectly until update to 1.0.8. I noticed a problem when I saw 'holes' in the graphs, like this: image I use ramdisk of size 1G for speedup storing performance data (/var/data/naemon/var-ramdisk/perfdata/service-perfdata) with enough freespace on it. It was first thing I checked. Should I try running PNP4Nagios in different mode? I.e. Bulk Mode with NPCD?

nook24 commented 5 years ago

Have you already tried to uses mod_gearman for processing of performance data? https://labs.consol.de/de/nagios/mod-gearman/index.html#_server_options https://docs.pnp4nagios.org/pnp-0.6/modes#gearman_mode

Fortunately I don't use NPCD, process_perfdata.pl and Naemons in-build process_performance_data for years anymore.

I always used Bulk Mode with NPCD. but you realy need to take care that NPCD is running, otherwise you will run out of disk inodes.

Did you already run Naemon in foreground to see if there will be any errors reported like so?

sudo -u naemon /bin/bash
/opt/naemon/bin/naemon /opt/naemon/etc/naemon/naemon.cfg

Maybe you see something like Unable to fork: Cannot allocate memory

VladimirBilik commented 5 years ago

I've just started naemon in foreground with screen session. I'll need some time to catch a service which stops writing performance data now. The last one will be probably OK after restart.

Regarding PNP with mod_gearman: it is a bit complicated to compile it in my environment, so I will try the NPCD.

According the Naemon source code there is nothing between calling

check_for_service_flapping()
check_for_host_flapping()

and

update_service_performance_data()

I'm looking into _update_service_performancedata. Giving assumption that process_performance_data=TRUE and svc->process_performance_data=TRUE in my static Naemon configuration, there is just one more check for _service_perfdata_process_emptyresults. I will set it temporary to TRUE. If it helps, I suppose there will be something in one of these conditions:

        if (!svc || !svc->perf_data || !*svc->perf_data) {
            return OK;
        }
        if ((!service_perfdata_file_template) && !service_perfdata_command) {
            return OK;
        }

What do you think?

nook24 commented 5 years ago

I quickly updated my Naemon demo system from 1.0.3 to 1.0.8. (I'am not using mod_gearman) I'm used to get my performance data from this boy: https://github.com/naemon/naemon-core/blob/1d420731922a0b916f08f511dcad6a44863d3b2f/src/naemon/checks_service.c#L1007-L1022 which is just a few lines above update_service_performance_data()

So, should be nearly impossible that performance data gets vanished between those lines.

My system is now running at Naemon 1.0.8 since Tue 2018-12-11 20:49:24 CET. I will check for missing performance data any now and than. The system is also publicly available if you want to take a look.

VladimirBilik commented 5 years ago

My assumption about settings of _service_perfdata_process_emptyresults was wrong, I see that it is set to 1 by default. So the mentioned part of code with conditions is never run. Back to my observations:

VladimirBilik commented 5 years ago

Regarding testing NPCD mode: a version distributed with pnp4nagios (pnp4nagios-0.6.25-1.el7.x86_64) have npcdmod.o just for Nagios, which doesn't work with my Naemon:

Error: Could not load module '/usr/lib64/nagios/brokers/npcdmod.o' -> /usr/lib64/nagios/brokers/npcdmod.o: undefined symbol: schedule_new_event

so I have to compile a version of mine (a bit complication with maintenance)

nook24 commented 5 years ago

As far as I know does npcdmod.o only support Nagios 3. There is no version for Nagios 4 or Naemon available. From the docs:

Attention Starting with Nagios 4 the internal structures have changed so the start of the module will fail. So far there are no plans to support Nagios 4. Please select any other of the modes.

sni commented 5 years ago

while thats true, its not too complicated to get pnp running with naemon, we build the npcdmod module in omd for naemon like this: https://github.com/ConSol/omd/tree/labs/packages/pnp4nagios4

nook24 commented 5 years ago

@VladimirBilik were you able to find the root cause for the missing performance data?

VladimirBilik commented 5 years ago

Unfortunately no. I've tested npcdmod.o (@sni - I have pulled it from your OMD distribution) which worked fine. But I noticed that this module writes performance data in format:

DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\t\
SERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\t\
HOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\t\
SERVICESTATETYPE::$SERVICESTATETYPE$

while I need to have performance data in format defined in _service_perfdata_filetemplate:

DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\t\
SERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\t\
HOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\t\
SERVICESTATETYPE::$SERVICESTATETYPE$\tSERVICEOUTPUT::$SERVICEOUTPUT$

i.e. there is missing the $SERVICEOUTPUT$ macro in performance data produced by npcdmod. For testing purposes I changed my file template so there was no $SERVICEOUTPUT$ macro in it. After 24hours of observation there was just one occurence of missing data. In comparison to previous results this was much better behaviour. Unfortunately - I need to have $SERVICEOUTPUT$ in our graphs, so I switched back to previous definition of data file template.

Yesterday I have switched to new core 1.0.9 and performace data template with $SERVICEOUTPUT$ in it. Till now I noticed just one service with missing data, so I can tell it is much better.

But I still have no idea why it is happens. Most strange for me is that when some service stops getting data, after forcing check (just once) it is switched to normal behaviour with full data in _service_perfdatafile I thing it could have something to do with macro processing - when I change the data file template so it has no $SERVICEOUTPUT$, it behaves almost normally - just one exception in 24hours. Also - when I reduced the number of services defined in naemon to 100, I noticed no data loss. My system is not overloaded even with 2000 monitored services, it has enough memory and there is no killed processes recorded in system logs. @nook24 - I see that in your demo everything works fine, without data loss. Please, could you test your system with more services defined? For example 2000 of pings with collecting performance data at least for 24hours?

nook24 commented 5 years ago

@VladimirBilik I'm on vacation and don't have any access to some powerful hardware... The demo system is just a cheap VPS where i can't deploy such a load load.

Have you already tried to strip your system down and removed all broker modules?