opnsense / core

OPNsense GUI, API and systems backend
https://opnsense.org/
BSD 2-Clause "Simplified" License
3.36k stars 753 forks source link

logger spins, burning lots of CPU #6093

Closed ckuethe closed 1 year ago

ckuethe commented 2 years ago

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

Describe the bug

Occasionally logger spins and burns a lot of cpu. Also lots of aborts on signal 6.

Not sure when this started, I think I first noticed it in 22.7.2, got annoyed enough to wrap it with a ulimit in 22.7.4, and after updating to 22.7.6 decided it's maybe worth a bug report.

To Reproduce

No explicit action, the offending calls to logger are internally generated while renewing my external DHCP lease.

Expected behavior

logger logs a message, and then quietly exits.

Describe alternatives you considered

A workaround is to limit logger's cputime to just a couple of seconds. It shouldn't need any more than that to log a simple message.

--- /usr/local/etc/inc/interfaces.inc.orig  2022-10-11 23:57:35.000000000 -0700
+++ /usr/local/etc/inc/interfaces.inc   2022-10-17 09:49:58.025778000 -0700
@@ -2823,11 +2823,11 @@
     $dhcp6cscript = <<<EOF
 #!/bin/sh
 if [ -n "{$dhcp6_debug}" ]; then
-    /usr/bin/logger -t dhcp6c "dhcp6c \$REASON on {$wanif}"
+    (ulimit -t 2 ; /usr/bin/logger -t dhcp6c "dhcp6c \$REASON on {$wanif}")
 fi
 case \$REASON in
 INFOREQ|REQUEST)
-    /usr/bin/logger -t dhcp6c "dhcp6c \$REASON on {$wanif} - running newipv6"
+    (ulimit -t 2 ; /usr/bin/logger -t dhcp6c "dhcp6c \$REASON on {$wanif} - running newipv6")
     ARGS=
     for NAMESERVER in \${new_domain_name_servers}; do
         ARGS="\${ARGS} -a \${NAMESERVER}"
@@ -2842,7 +2842,7 @@
     /usr/local/sbin/configctl -d interface newipv6 {$wanif}
     ;;
 EXIT|RELEASE)
-    /usr/bin/logger -t dhcp6c "dhcp6c \$REASON on {$wanif} - running dns reload"
+    (ulimit -t 2 ; /usr/bin/logger -t dhcp6c "dhcp6c \$REASON on {$wanif} - running dns reload")
     /usr/local/sbin/ifctl -i ${wanif} -6nd
     /usr/local/sbin/ifctl -i ${wanif} -6sd
     /usr/local/sbin/ifctl -i ${wanif} -6pd
@@ -2910,10 +2910,10 @@
     fi
 fi
 if [ -f /var/run/dhcp6c.pid ]; then
-    /usr/bin/logger -t dhcp6c "RTSOLD script - Sending SIGHUP to dhcp6c"
+    (ulimit -t 2 ; /usr/bin/logger -t dhcp6c "RTSOLD script - Sending SIGHUP to dhcp6c")
     /bin/pkill -HUP -F /var/run/dhcp6c.pid
 else
-    /usr/bin/logger -t dhcp6c "RTSOLD script - Starting dhcp6 client"
+    (ulimit -t 2 ; /usr/bin/logger -t dhcp6c "RTSOLD script - Starting dhcp6 client")
     {$dhcp6ccommand}
 fi

Screenshots From top

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
86852 root          1 100    0    12M  2168K CPU0     0  20.9H  97.94% logger

From ps

root    86852 98.8  0.1  12796   2168  -  R    12:19   1239:44.63 logger: zygote (logger)
root    86850  0.0  0.1  12796   2308  -  IC   12:19      0:00.00 /usr/bin/logger -t dhcp6c dhcp6c RELEASE on igb0 - running dns reload
root    86851  0.0  0.1  12796   2172  -  I    12:19      0:00.00 /usr/bin/logger -t dhcp6c dhcp6c RELEASE on igb0 - running dns reload

Relevant log files

Lots of logger aborts in dmesg.

...
pid 82046 (logger), jid 0, uid 0: exited on signal 6 (core dumped)
pid 82079 (logger), jid 0, uid 0: exited on signal 6 (core dumped)
pid 82112 (logger), jid 0, uid 0: exited on signal 6 (core dumped)
pid 82113 (logger), jid 0, uid 0: exited on signal 6 (core dumped)
pid 82150 (logger), jid 0, uid 0: exited on signal 6 (core dumped)
pid 82151 (logger), jid 0, uid 0: exited on signal 6 (core dumped)

No other processes dumping core

root@fw:/ # dmesg | grep dumped | grep logger | wc -l
    1173
root@fw:/ # dmesg | grep dumped | grep -v logger | wc -l
       0

GDB isn't super helpful.

Program terminated with signal SIGBUS, Bus error.
Address not present.
#0  0x00000008014740dc in buf_recv () from /lib/libnv.so.0
(gdb) bt
#0  0x00000008014740dc in buf_recv () from /lib/libnv.so.0
#1  0x00000008014760c6 in nvlist_recv () from /lib/libnv.so.0
#2  0x000000080106d551 in service_message () from /lib/libcasper.so.1
#3  0x000000080106cf7e in casper_main_loop () from /lib/libcasper.so.1
#4  0x000000080106c4ef in cap_init () from /lib/libcasper.so.1
#5  0x0000000001023f23 in ?? ()
#6  0x0000000001023c9d in ?? ()
#7  0x0000000000000000 in ?? ()

Additional context

I have syslog set to forward all log to an internal log collector

Environment

Software version used and hardware type if relevant, e.g.: OPNsense 22.7.6-amd64 Protectli FW2B Intel(R) Celeron(R) CPU J3060 @ 1.60GHz (2 cores, 2 threads) 2x IGB: Intel(R) I211 (Copper)

AdSchellevis commented 2 years ago

zygote ? any non standard packages installed?

ckuethe commented 2 years ago

The only thing I didn't install from the "System: Firmware > plugins" was gdb, and that was just I could get that backtrace.

Zygote seems to be related to libcasper, eg.

ckuethe commented 2 years ago

On that note, these are the plugins I have installed.

Screenshot_2022-10-17_11-44-36

ckuethe commented 1 year ago

Still present on

OPNsense 22.7.8-amd64
FreeBSD 13.1-RELEASE-p3
OpenSSL 1.1.1s 1 Nov 2022
AdSchellevis commented 1 year ago

It's highly likely not related to something in core, you could always try to "truss" the process to see what it is trying to open or send, which might leave a pointer about which application is responsible here (truss -p <pid>)

OPNsense-bot commented 1 year ago

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository, please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue, just let us know, so we can reopen the issue and assign an owner to it.