naemon reload occasionally hangs on "livestatus: deinitializing" for many minutes

pvdputte commented 6 years ago

What happened earlier today is this:

a naemon reload took a very long time to finish but it eventually fell through after 15 minutes
the next reload hung again but I didn't wait for it to finish and killed the cpu-hogging naemon process

Running these packages on Debian 9 stretch:

# dpkg -l | grep -e gearm -e naemon
ii  gearman-job-server                   0.33-6                         amd64        Job server for the Gearman distributed job queue
ii  gearman-tools                        0.33-6                         amd64        Tools for the Gearman distributed job queue
ii  libgearman7                          0.33-6                         amd64        Library providing Gearman client and worker functions
ii  libnaemon                            1.0.7                          amd64        Library for Naemon - common data files
ii  mod-gearman-module                   3.0.5                          amd64        Event broker module to distribute service checks.
ii  mod-gearman-tools                    3.0.5                          amd64        Tools for mod-gearman
ii  naemon-core                          1.0.7                          amd64        host/service/network monitoring and management system
ii  naemon-livestatus                    1.0.7                          amd64        contains the Naemon livestatus eventbroker module

and also the 3.0.6 version of mod_gearman_naemon.o b/c of the memory issue.

Naemon log for the first restart:

[1536314498] Caught 'Hangup', restarting...
[1536314502] Event broker module 'NERD' deinitialized successfully.
[1536314502] Event broker module '/usr/lib/mod_gearman/mod_gearman_naemon.o' deinitialized successfully.
[1536314502] livestatus: deinitializing
[1536315420] Event broker module '/usr/lib/naemon/naemon-livestatus/livestatus.so' deinitialized successfully.
[1536315420] Naemon 1.0.7 starting... (PID=866)

Deinitializing livestatus took 918 seconds. Normally this takes less than 0 seconds according to the logs. On other problematic occasions it was around 150 seconds.

When reloading the config a few minutes later, it hung again:

[1536316297] Caught 'Hangup', restarting...
[1536316308] Event broker module 'NERD' deinitialized successfully.
[1536316308] Event broker module '/usr/lib/mod_gearman/mod_gearman_naemon.o' deinitialized successfully.
[1536316308] livestatus: deinitializing

At which point I started looking into the problem.

# top

top - 12:33:39 up 92 days, 12:54,  2 users,  load average: 28.94, 25.96, 14.33
Tasks: 447 total,   1 running, 446 sleeping,   0 stopped,   0 zombie
%Cpu(s): 99.3 us,  0.7 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem :  8179076 total,   660132 free,  1807072 used,  5711872 buff/cache
KiB Swap:  2097148 total,  2089896 free,     7252 used.  5979608 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  866 naemon    20   0 3815680 857792   7212 S 570.8 10.5   1086:57 naemon
 1400 naemon    20   0  564456 486404   1752 S   0.0  5.9   0:12.73 naemon
  465 gearman   20   0  552472 114784   3680 S   0.0  1.4 209176:34 gearmand
  273 root      20   0   92372  21168   9660 S   0.0  0.3  13:52.11 systemd-journal
[...]

Nothing to see in the livestatus log.

Strace shows this. Futex contention?

# strace -s200 -f -p 866
strace: Process 866 attached with 29 threads
strace: [ Process PID=12180 runs in x32 mode. ]
strace: [ Process PID=8547 runs in x32 mode. ]
[pid   866] futex(0x7f2f605549d0, FUTEX_WAIT, 6341, NULL^C

strace: Process 866 detached
 <detached ...>
strace: Process 6341 detached
strace: Process 6469 detached
strace: Process 6894 detached
strace: Process 8291 detached
strace: Process 8359 detached
strace: Process 8547 detached
strace: Process 9025 detached
strace: Process 10159 detached
strace: Process 10286 detached
strace: Process 10309 detached
strace: Process 10345 detached
strace: Process 10401 detached
strace: Process 10429 detached
strace: Process 10445 detached
strace: Process 10800 detached
strace: Process 10814 detached
strace: Process 10829 detached
strace: Process 10848 detached
strace: Process 10877 detached
strace: Process 10893 detached
strace: Process 10909 detached
strace: Process 10917 detached
strace: Process 10943 detached
strace: Process 10964 detached
strace: Process 11045 detached
strace: Process 11068 detached
strace: Process 11185 detached
strace: Process 12180 detached

Eventually I killed PID 866 and started naemon again.

Next time I'll try to attach gdb. I guess I better install these packages first?

naemon-core-dbg/stable 1.0.8 amd64
naemon-livestatus-dbgsym/stable 1.0.8 amd64

Will that have a severe impact on performance?

This has happened at least four times in the past two months. On average we reload 30 times a day. As I'm no longer doing a daily restart since two days (cfr gearman memory leak) I may run into this issue faster now.

I use "service naemon reload" to reload the config (triggered by cron on config updates). If there's a way to reload the config without deinitializing mod_gearman & livestatus I'd love to hear about it.

Greetings, Pascal

pvdputte commented 6 years ago

Forgot to mention that the host is a VMware guest with 6 vCpus.

jacobbaungard commented 6 years ago

Unfortunately I don't think there are any good ways of reloading Nameon. Usually we run "restart" instead of reload. Does Naemon actually handle SIGTERM and SIGHUP differently?

A better way to handle config reloading is definitely something we (OP5) are thinking about for the future. Ideally reloading the configuration should be done in a graceful way where Naemon can continue to operate, livestatus keeps working and so forth. This is probably a longer undertaking though, and not really much help for this specific issue.

nook24 commented 6 years ago

I always reload all my Naemon systems from 10 to 100k service checks. Most of the time this works without any issues. From time to time Naemon crash on an reload, but systemd will revive it :)

I don't use livestatus or mod_gearman. Unfortunately my comment will not help you solving your issue. I just want to say that reloading Naemon is technically possible.

jacobbaungard commented 6 years ago

Sure it is possible to reload, but afaik, except for actually starting a new process, there is not much difference between a restart and reload.

nook24 commented 6 years ago

I had a similar issue with Nagios Core while my broker module was loaded and exporting logentry data. Every time i tried to reload nagios core through systemd it took ages. The root cause was a bug in Nagios which was fixed in one of the latest versions. https://github.com/NagiosEnterprises/nagioscore/issues/442

However maybe there is something similar in Naemon and Livestatus? If possible try to disable as much livestatus features as possible (i don't know if features can be disabled - i don't use livestatus at all) and check if the issue is gone. If yes, enable features step by step and check when the issue will come back.

pvdputte commented 6 years ago

I just compared reload with restart and the restart takes 12 seconds longer in our case (41s instead of 29). At the current rate of 30 restarts/day that means 20 minutes of 'being in the dark' instead of 15. So I'll try to stick with reloading for now :-)

service naemon restart:

[1536322522] wproc: Socket to worker Core Worker 20151 broken, removing
[1536322522] wproc: Socket to worker Core Worker 20150 broken, removing
[1536322522] wproc: Socket to worker Core Worker 20152 broken, removing
[1536322522] wproc: Socket to worker Core Worker 20148 broken, removing
[1536322522] Caught 'Terminated', shutting down...
...
[1536322563] Successfully launched command file worker with pid 20363
[1536322563] TIMEPERIOD TRANSITION: 1005-1010;-1;0

service naemon reload:

[1536322571] Caught 'Hangup', restarting...
...
[1536322600] TIMEPERIOD TRANSITION: 1005-1010;-1;0

sni commented 6 years ago

i've seen such issues too already in setups with naemon-core and livestatus. It might be some long-running queries, such as log requests. Afaik running queries are not terminated at any time.

pvdputte commented 6 years ago

Some more info

*** naemon.log ***
[1536918869] Caught 'Hangup', restarting...
[1536918873] Event broker module 'NERD' deinitialized successfully.
[1536918873] Event broker module '/usr/lib/mod_gearman/mod_gearman_naemon.o' deinitialized successfully.
[1536918873] livestatus: deinitializing

*** livestatus.log ***
2018-09-14 09:15:01 Warning ignoring line containing only whitespace
2018-09-14 09:35:02 Warning ignoring line containing only whitespace
2018-09-14 10:35:01 Warning ignoring line containing only whitespace
2018-09-14 10:35:01 Warning ignoring line containing only whitespace
2018-09-14 10:35:01 Warning ignoring line containing only whitespace
2018-09-14 10:55:01 Warning ignoring line containing only whitespace
2018-09-14 11:15:01 Warning ignoring line containing only whitespace
2018-09-14 11:25:01 Warning ignoring line containing only whitespace
2018-09-14 11:40:01 Warning ignoring line containing only whitespace
2018-09-14 11:40:01 Warning ignoring line containing only whitespace

*** pstree ***
systemd,1
  `-naemon,12247 -d /etc/naemon/naemon.cfg
      |-naemon,12289 -d /etc/naemon/naemon.cfg
      |-{naemon},2326
      |-{naemon},2338
      |-{naemon},2524
      |-{naemon},2576
      |-{naemon},2648
      |-{naemon},2996
      |-{naemon},3970
      `-{naemon},4307

*** GDB ***
PID 12289
=========
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fef7fee5660 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.
#0  0x00007fef7fee5660 in __poll_nocancel () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007fef809de226 in launch_command_file_worker () from /usr/lib/naemon/libnaemon.so.0
#2  0x00005589ac3d371e in main ()

PID 12247
=========
[New LWP 2326]
[New LWP 2338]
[New LWP 2524]
[New LWP 2576]
[New LWP 2648]
[New LWP 2996]
[New LWP 3970]
[New LWP 4307]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fef7f97e6cd in pthread_join (threadid=140666635818752, thread_return=0x0) at pthread_join.c:90
90      pthread_join.c: No such file or directory.
#0  0x00007fef7f97e6cd in pthread_join (threadid=140666635818752, thread_return=0x0) at pthread_join.c:90
#1  0x00007fef7e3a1121 in nebmodule_deinit () from /usr/lib/naemon/naemon-livestatus/livestatus.so
#2  0x00007fef809f5118 in neb_unload_module () from /usr/lib/naemon/libnaemon.so.0
#3  0x00007fef809f5453 in neb_unload_all_modules () from /usr/lib/naemon/libnaemon.so.0
#4  0x00007fef80a0892a in cleanup () from /usr/lib/naemon/libnaemon.so.0
#5  0x00005589ac3d38a7 in main ()

I use Thruk's logfile cache https://www.thruk.org/documentation/logfile-cache.html I can't really think of other long-running queries.

Livestatus is accessed over TCP using xinetd. I.e.

# cat /etc/xinetd.d/naemon-livestatus-thruk 
Service livestatus
{
    type                       = UNLISTED
    port                       = 6557
    socket_type                = stream
    protocol                   = tcp
    wait                       = no
    # limit to 100 connections per second. Disable 3 secs if above.
    cps                        = 100 3
    # Set the number of maximum allowed parallel instances of unixcat.
    # Please make sure that this values is at least as high as
    # the number of threads defined with num_client_threads in
    # etc/mk-livestatus/nagios.cfg
    instances                  = 500
    # limit the maximum number of simultaneous connections from
    # one source IP address
    per_source                 = 250
    # Disable TCP delay, makes connection more responsive
    flags                      = NODELAY
    user                       = naemon
    server                     = /usr/bin/unixcat
    server_args                = /var/cache/naemon/live
    # configure the IP address(es) of your Nagios server here:
    only_from                  = 127.0.0.1 217.21.181.25
    disable                    = no
}

(old config file hence the reference to mk-livestatus which does not actually apply as naemon has its own livestatus implementation?)

I'm running this config multiple times on different ports so I can keep some track of different livestatus api consumers. I'll check netstat output the next time it happens and see if there is something lingering.

naemon / naemon-core

naemon reload occasionally hangs on "livestatus: deinitializing" for many minutes #258