Open pvdputte opened 6 years ago
Forgot to mention that the host is a VMware guest with 6 vCpus.
Unfortunately I don't think there are any good ways of reloading Nameon. Usually we run "restart" instead of reload. Does Naemon actually handle SIGTERM and SIGHUP differently?
A better way to handle config reloading is definitely something we (OP5) are thinking about for the future. Ideally reloading the configuration should be done in a graceful way where Naemon can continue to operate, livestatus keeps working and so forth. This is probably a longer undertaking though, and not really much help for this specific issue.
I always reload all my Naemon systems from 10 to 100k service checks. Most of the time this works without any issues. From time to time Naemon crash on an reload, but systemd will revive it :)
I don't use livestatus
or mod_gearman
. Unfortunately my comment will not help you solving your issue. I just want to say that reloading Naemon is technically possible.
Sure it is possible to reload, but afaik, except for actually starting a new process, there is not much difference between a restart and reload.
I had a similar issue with Nagios Core while my broker module was loaded and exporting logentry data. Every time i tried to reload nagios core through systemd it took ages. The root cause was a bug in Nagios which was fixed in one of the latest versions. https://github.com/NagiosEnterprises/nagioscore/issues/442
However maybe there is something similar in Naemon and Livestatus? If possible try to disable as much livestatus features as possible (i don't know if features can be disabled - i don't use livestatus at all) and check if the issue is gone. If yes, enable features step by step and check when the issue will come back.
I just compared reload with restart and the restart takes 12 seconds longer in our case (41s instead of 29). At the current rate of 30 restarts/day that means 20 minutes of 'being in the dark' instead of 15. So I'll try to stick with reloading for now :-)
service naemon restart:
[1536322522] wproc: Socket to worker Core Worker 20151 broken, removing
[1536322522] wproc: Socket to worker Core Worker 20150 broken, removing
[1536322522] wproc: Socket to worker Core Worker 20152 broken, removing
[1536322522] wproc: Socket to worker Core Worker 20148 broken, removing
[1536322522] Caught 'Terminated', shutting down...
...
[1536322563] Successfully launched command file worker with pid 20363
[1536322563] TIMEPERIOD TRANSITION: 1005-1010;-1;0
service naemon reload:
[1536322571] Caught 'Hangup', restarting...
...
[1536322600] TIMEPERIOD TRANSITION: 1005-1010;-1;0
i've seen such issues too already in setups with naemon-core and livestatus. It might be some long-running queries, such as log requests. Afaik running queries are not terminated at any time.
Some more info
*** naemon.log ***
[1536918869] Caught 'Hangup', restarting...
[1536918873] Event broker module 'NERD' deinitialized successfully.
[1536918873] Event broker module '/usr/lib/mod_gearman/mod_gearman_naemon.o' deinitialized successfully.
[1536918873] livestatus: deinitializing
*** livestatus.log ***
2018-09-14 09:15:01 Warning ignoring line containing only whitespace
2018-09-14 09:35:02 Warning ignoring line containing only whitespace
2018-09-14 10:35:01 Warning ignoring line containing only whitespace
2018-09-14 10:35:01 Warning ignoring line containing only whitespace
2018-09-14 10:35:01 Warning ignoring line containing only whitespace
2018-09-14 10:55:01 Warning ignoring line containing only whitespace
2018-09-14 11:15:01 Warning ignoring line containing only whitespace
2018-09-14 11:25:01 Warning ignoring line containing only whitespace
2018-09-14 11:40:01 Warning ignoring line containing only whitespace
2018-09-14 11:40:01 Warning ignoring line containing only whitespace
*** pstree ***
systemd,1
`-naemon,12247 -d /etc/naemon/naemon.cfg
|-naemon,12289 -d /etc/naemon/naemon.cfg
|-{naemon},2326
|-{naemon},2338
|-{naemon},2524
|-{naemon},2576
|-{naemon},2648
|-{naemon},2996
|-{naemon},3970
`-{naemon},4307
*** GDB ***
PID 12289
=========
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fef7fee5660 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:84
84 ../sysdeps/unix/syscall-template.S: No such file or directory.
#0 0x00007fef7fee5660 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1 0x00007fef809de226 in launch_command_file_worker () from /usr/lib/naemon/libnaemon.so.0
#2 0x00005589ac3d371e in main ()
PID 12247
=========
[New LWP 2326]
[New LWP 2338]
[New LWP 2524]
[New LWP 2576]
[New LWP 2648]
[New LWP 2996]
[New LWP 3970]
[New LWP 4307]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fef7f97e6cd in pthread_join (threadid=140666635818752, thread_return=0x0) at pthread_join.c:90
90 pthread_join.c: No such file or directory.
#0 0x00007fef7f97e6cd in pthread_join (threadid=140666635818752, thread_return=0x0) at pthread_join.c:90
#1 0x00007fef7e3a1121 in nebmodule_deinit () from /usr/lib/naemon/naemon-livestatus/livestatus.so
#2 0x00007fef809f5118 in neb_unload_module () from /usr/lib/naemon/libnaemon.so.0
#3 0x00007fef809f5453 in neb_unload_all_modules () from /usr/lib/naemon/libnaemon.so.0
#4 0x00007fef80a0892a in cleanup () from /usr/lib/naemon/libnaemon.so.0
#5 0x00005589ac3d38a7 in main ()
I use Thruk's logfile cache https://www.thruk.org/documentation/logfile-cache.html I can't really think of other long-running queries.
Livestatus is accessed over TCP using xinetd. I.e.
# cat /etc/xinetd.d/naemon-livestatus-thruk
Service livestatus
{
type = UNLISTED
port = 6557
socket_type = stream
protocol = tcp
wait = no
# limit to 100 connections per second. Disable 3 secs if above.
cps = 100 3
# Set the number of maximum allowed parallel instances of unixcat.
# Please make sure that this values is at least as high as
# the number of threads defined with num_client_threads in
# etc/mk-livestatus/nagios.cfg
instances = 500
# limit the maximum number of simultaneous connections from
# one source IP address
per_source = 250
# Disable TCP delay, makes connection more responsive
flags = NODELAY
user = naemon
server = /usr/bin/unixcat
server_args = /var/cache/naemon/live
# configure the IP address(es) of your Nagios server here:
only_from = 127.0.0.1 217.21.181.25
disable = no
}
(old config file hence the reference to mk-livestatus which does not actually apply as naemon has its own livestatus implementation?)
I'm running this config multiple times on different ports so I can keep some track of different livestatus api consumers. I'll check netstat output the next time it happens and see if there is something lingering.
What happened earlier today is this:
Running these packages on Debian 9 stretch:
and also the 3.0.6 version of mod_gearman_naemon.o b/c of the memory issue.
Naemon log for the first restart:
Deinitializing livestatus took 918 seconds. Normally this takes less than 0 seconds according to the logs. On other problematic occasions it was around 150 seconds.
When reloading the config a few minutes later, it hung again:
At which point I started looking into the problem.
Nothing to see in the livestatus log.
Strace shows this. Futex contention?
Eventually I killed PID 866 and started naemon again.
Next time I'll try to attach gdb. I guess I better install these packages first?
Will that have a severe impact on performance?
This has happened at least four times in the past two months. On average we reload 30 times a day. As I'm no longer doing a daily restart since two days (cfr gearman memory leak) I may run into this issue faster now.
I use "service naemon reload" to reload the config (triggered by cron on config updates). If there's a way to reload the config without deinitializing mod_gearman & livestatus I'd love to hear about it.
Greetings, Pascal