naemon / naemon-core

Networks, Applications and Event Monitor
http://www.naemon.io/
GNU General Public License v2.0
151 stars 63 forks source link

Naemon reload fails silently randomly #443

Open Bryce-Souers opened 7 months ago

Bryce-Souers commented 7 months ago

I have an issue where sometimes a reload will seem to work, and then it will crash with no messages telling why.

I ran omd reload over 30 times, and then I hit this issue.

naemon.log

[1705509045] SERVICE ALERT: <REDACTED>;check_ap_ipv6;CRITICAL;SOFT;3;PING CRITICAL [fda9:ba49:dcb2:32a5:213:5009:1:9fa2] - Packet loss = 100%
[1705509050] Caught 'Hangup', restarting...
[1705509052] Retention data successfully saved.
[1705509052] Event broker module 'NERD' deinitialized successfully.
[1705509052] npcdmod: If you don't like me, I will go out! Bye.
[1705509052] Event broker module '/omd/sites/ssnomd/lib/npcdmod4.o' deinitialized successfully.
[1705509052] livestatus: deinitializing
[1705509052] livestatus: Logfile cache: flushing complete cache.
[1705509052] Event broker module '/omd/sites/ssnomd/lib/naemon/livestatus.o' deinitialized successfully.
[1705509053] Naemon 1.4.0 starting... (PID=2823702)
[1705509053] Local time is Wed Jan 17 16:30:53 UTC 2024
[1705509053] LOG VERSION: 2.0
[1705509053] qh: Socket '/omd/sites/ssnomd/var/naemon/naemon.qh' successfully initialized
[1705509053] nerd: Channel hostchecks registered successfully
[1705509053] nerd: Channel servicechecks registered successfully
[1705509053] nerd: Fully initialized and ready to rock!

debug.log

[1705509052.736467] [016.0] [pid=2823702] Service 'check_2days_uptime' on host 'REDACTED' handle_service_check_event()...
[1705509052.742485] [064.0] [pid=2823702] Attempting to unload module 'NERD': flags=1, reason=2
[1705509052.742527] [064.0] [pid=2823702] Module 'NERD' unloaded successfully.
[1705509052.742578] [064.0] [pid=2823702] Attempting to unload module '/omd/sites/ssnomd/lib/npcdmod4.o': flags=1, reason=2
[1705509052.742693] [064.0] [pid=2823702] Module '/omd/sites/ssnomd/lib/npcdmod4.o' unloaded successfully.
[1705509052.742744] [064.0] [pid=2823702] Attempting to unload module '/omd/sites/ssnomd/lib/naemon/livestatus.o': flags=1, reason=2
[1705509052.744596] [064.0] [pid=2823702] Module '/omd/sites/ssnomd/lib/naemon/livestatus.o' unloaded successfully.
[1705509053.382125] [064.1] [pid=2823702] Making callbacks (type 2)...
[1705509053.382187] [064.1] [pid=2823702] Making callbacks (type 2)...
[1705509053.382220] [064.1] [pid=2823702] Making callbacks (type 2)...
[1705509053.382251] [064.1] [pid=2823702] Making callbacks (type 2)...
[1705509053.382270] [4096.0] [pid=2823702] wproc: Successfully registered manager as @wproc with query handler
[1705509053.589750] [4096.0] [pid=2823702] wproc: Registry request: name=Core Worker 2897369;pid=2897369
[1705509053.590014] [4096.0] [pid=2823702] wproc: Registry request: name=Core Worker 2897370;pid=2897370
[1705509053.590135] [4096.0] [pid=2823702] wproc: Registry request: name=Core Worker 2897371;pid=2897371
[1705509053.629639] [4096.0] [pid=2823702] wproc: Registry request: name=Core Worker 2897372;pid=2897372
[1705509053.666278] [4096.0] [pid=2823702] wproc: Registry request: name=Core Worker 2897374;pid=2897374
[1705509053.672590] [4096.0] [pid=2823702] wproc: Registry request: name=Core Worker 2897373;pid=2897373

omd status

[ssnomd@lasssnpr01nag02 ~]$ omd status
rrdcached:      running
npcd:           running
naemon:         stopped
apache:         running
xinetd:         running
crontab:        running
-----------------------
Overall state:  partially running

ps aux

[ssnomd@lasssnpr01nag02 ~]$ ps aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0  11928  2848 ?        Ss   Jan16   0:00 bash /root/start.sh
root         451  0.0  0.0  22964  2936 ?        Ss   Jan16   0:00 /usr/sbin/crond
root        1383  0.0  0.0  11928  2720 ?        S    Jan16   0:00 /bin/sh /usr/sbin/apachectl -D FOREGROUND
root        1387  0.0  0.0 267260 14672 ?        S    Jan16   0:05 /usr/sbin/httpd -D FOREGROUND
apache      1400  0.0  0.0 268696  8424 ?        S    Jan16   0:00 /usr/sbin/httpd -D FOREGROUND
apache      1401  0.0  0.0 270072  8760 ?        S    Jan16   0:01 /usr/sbin/httpd -D FOREGROUND
apache      1402  0.0  0.0 2704096 12792 ?       Sl   Jan16   0:36 /usr/sbin/httpd -D FOREGROUND
apache      1403  0.0  0.0 2572960 10416 ?       Sl   Jan16   0:35 /usr/sbin/httpd -D FOREGROUND
apache      1418  0.0  0.0 2572960 12984 ?       Sl   Jan16   0:35 /usr/sbin/httpd -D FOREGROUND
apache    390476  0.0  0.0 2507424 10444 ?       Sl   Jan16   0:28 /usr/sbin/httpd -D FOREGROUND
root     2711500  0.0  0.0  19248  3616 pts/1    Ss   14:44   0:00 bash
root     2711611  0.0  0.0 115204  7284 pts/1    S    14:44   0:00 sudo su - ssnomd
root     2711612  0.0  0.0 109184  6164 pts/1    S    14:44   0:00 su - ssnomd
ssnomd   2711613  0.0  0.0  22508  4400 pts/1    S    14:44   0:00 -bash
root     2739377  0.0  0.0  19248  3688 pts/2    Ss   14:55   0:00 bash
root     2739563  0.0  0.0 115204  7272 pts/2    S    14:55   0:00 sudo su - ssnomd
root     2739564  0.0  0.0 109184  6332 pts/2    S    14:55   0:00 su - ssnomd
ssnomd   2739566  0.0  0.0  22496  4428 pts/2    S    14:55   0:00 -bash
root     2766725  0.0  0.0  19248  3600 pts/0    Ss   15:06   0:00 bash
root     2766786  0.0  0.0 109184  6172 pts/0    S    15:06   0:00 su - ssnomd
ssnomd   2766787  0.0  0.0  22388  4436 pts/0    S+   15:06   0:00 -bash
root     2808475  0.0  0.0  19248  3716 pts/3    Ss   15:22   0:00 bash
root     2817334  0.0  0.0  19248  3684 pts/4    Ss   15:27   0:00 bash
root     2817353  0.0  0.0 115204  7492 pts/4    S    15:27   0:00 sudo su - ssnomd
root     2817354  0.0  0.0 109184  6184 pts/4    S    15:27   0:00 su - ssnomd
ssnomd   2817355  0.0  0.0  22492  4396 pts/4    S+   15:27   0:00 -bash
root     2819589  0.0  0.0 115204  7400 pts/3    S    15:40   0:00 sudo su - ssnomd
root     2819590  0.0  0.0 109184  6112 pts/3    S    15:40   0:00 su - ssnomd
ssnomd   2819591  0.0  0.0  22388  4416 pts/3    S+   15:40   0:00 -bash
ssnomd   2823608  0.0  0.0 317892  8072 ?        Ssl  16:01   0:01 /omd/sites/ssnomd/bin/rrdcached -t 4 -w 3600 -z 1800 -f 7200 -s ssnomd -m 660 -l unix:/omd/sites/ssnomd/tmp/run/rrdcached.sock -p /omd/sites/ssnomd/tmp/rrdcached.pid -j /omd/sites/ssnomd/var/rrdcached -o /omd/sites/ssnomd/var/log/rrd
ssnomd   2823640  0.1  0.0 156184   380 ?        S    16:01   0:03 /omd/sites/ssnomd/bin/npcd -d -f /omd/sites/ssnomd/etc/pnp4nagios/npcd.cfg
ssnomd   2823760  0.0  0.0 118732  5936 ?        Ss   16:01   0:00 /usr/sbin/httpd -f /omd/sites/ssnomd/etc/apache/apache.conf
ssnomd   2823762  0.0  0.1 105612 30812 ?        S    16:01   0:00 perl -x /omd/sites/ssnomd/share/thruk/script/thruk_auth
ssnomd   2823763  0.0  0.0 118884  2596 ?        S    16:01   0:00 /usr/sbin/httpd -f /omd/sites/ssnomd/etc/apache/apache.conf
ssnomd   2823764  0.0  0.0 119400  5728 ?        S    16:01   0:00 /usr/sbin/httpd -f /omd/sites/ssnomd/etc/apache/apache.conf
ssnomd   2823832  0.0  0.4 162832 76892 ?        S    16:01   0:02 perl /omd/sites/ssnomd/share/thruk/script/thruk_fastcgi.pl
ssnomd   2823839  0.0  0.0 119640  5752 ?        S    16:01   0:00 /usr/sbin/httpd -f /omd/sites/ssnomd/etc/apache/apache.conf
ssnomd   2823920  0.0  0.0  55064  3400 ?        Ss   16:01   0:00 /omd/sites/ssnomd/var/tmp/xinetd -pidfile /omd/sites/ssnomd/tmp/run/xinetd.pid -filelog /omd/sites/ssnomd/var/log/xinetd.log -f /omd/sites/ssnomd/etc/xinetd.conf
ssnomd   2852767  0.0  0.0 118868  4860 ?        S    16:13   0:00 /usr/sbin/httpd -f /omd/sites/ssnomd/etc/apache/apache.conf
ssnomd   2895736  0.0  0.0  26000  1540 pts/1    S+   16:30   0:00 /usr/bin/coreutils --coreutils-prog-shebang=tail /usr/bin/tail -n 500 -f var/log/naemon.log
ssnomd   2898643  200  0.0  47644  3848 pts/2    R+   16:54   0:00 ps aux

Any ideas for what could be going wrong here?

michaelsbach commented 7 months ago

I suspect there is an issue with the worker process creation, as that's the last thing the debug.log is showing and it hangs there and then the process dies

@sni what say you?

sni commented 7 months ago

looks like naemon crashed during reload, i saw that occasionally in the wild as well but it wasn't easily reproducible. Which omd / naemon version is that in this case? It also might help to enable core file, this might reveal the origin of the issue.

nook24 commented 7 months ago

From the output @Bryce-Souers provided this is [1705509053] Naemon 1.4.0 starting... (PID=2823702)

So this is a version before the worker re-spwan feature which got introduced into 1.4.2. Where there any changes whining livestatus? There is also an other open issue #428

Bryce-Souers commented 7 months ago

looks like naemon crashed during reload, i saw that occasionally in the wild as well but it wasn't easily reproducible. Which omd / naemon version is that in this case? It also might help to enable core file, this might reveal the origin of the issue.

@sni

What do you mean by enable core file?

I have debugging on max as far as I know.

Bryce-Souers commented 7 months ago

I'd also like to add that this is running using the v5 of the omd DockerHub image