hamelg commented 3 years ago

Maintainer: Jo-Philipp Wich jo@mein.io, Hannu Nyman hannu.nyman@iki.fi Environment: Openwrt 19.07

Description: After a long uptime, I notice there are multiple instance of the sqm_collectd.sh script running. All duplicated instances are orphans.

My exec module configuration

LoadPlugin exec
<Plugin exec>
        Exec "nobody:nogroup" "/usr/libexec/collectd/sqm_collectd.sh" "pppoe-wan" "ifb4pppoe-wan"
</Plugin>

the top output shows the duplicated instances

  PID  PPID USER     STAT   VSZ %VSZ %CPU COMMAND
15253     1 nobody   SN    1260   1%   4% /bin/sh /usr/libexec/collectd/sqm_collectd.sh pppoe-wan ifb4pppoe-wan
 9008     1 root     S     3844   3%   1% /usr/sbin/snmpd -Lf /dev/null -f
 4836     1 root     S     1784   1%   1% /usr/sbin/hostapd -s -P /var/run/wifi-phy0.pid -B /var/run/hostapd-phy0.con
17655 17438 root     R     1208   1%   0% top
17437  2004 root     S     1144   1%   0% /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p x.x.x.x:22 -p 2001:4
  397     2 root     RW       0   0%   0% [kworker/0:3]
    7     2 root     SW       0   0%   0% [ksoftirqd/0]
16927     2 root     IW       0   0%   0% [kworker/u2:0]
17582     2 root     IW       0   0%   0% [kworker/u2:2]
17619     1 root     SN    7228   6%   0% /usr/sbin/collectd -C /tmp/collectd.conf -f
 1968     1 root     S     4464   4%   0% /usr/sbin/openvpn --syslog openvpn(myvpn) --status /var/run/openvpn.myvpn.s
  980     1 root     S     2204   2%   0% /sbin/rpcd
 4487     1 root     S     1760   1%   0% /usr/sbin/hostapd -s -P /var/run/wifi-phy1.pid -B /var/run/hostapd-phy1.con
 1260     1 root     S     1744   1%   0% /sbin/netifd
    1     0 root     S     1564   1%   0% /sbin/procd
 1291     1 root     S     1444   1%   0% /usr/sbin/odhcpd
 8142     1 root     S     1436   1%   0% /bin/sh /usr/lib/ddns/dynamic_dns_updater.sh -v 0 -S ovh -- start
 5085     1 dnsmasq  S     1372   1%   0% /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq
 1643     1 root     S     1360   1%   0% /usr/sbin/uhttpd -f -h /www -r Lede -x /cgi-bin -t 60 -T 30 -k 20 -A 1 -n 3
23440     1 nobody   SN    1264   1%   0% /bin/sh /usr/libexec/collectd/sqm_collectd.sh pppoe-wan ifb4pppoe-wan
 9768     1 nobody   SN    1264   1%   0% /bin/sh /usr/libexec/collectd/sqm_collectd.sh pppoe-wan ifb4pppoe-wan
23127     1 nobody   SN    1264   1%   0% /bin/sh /usr/libexec/collectd/sqm_collectd.sh pppoe-wan ifb4pppoe-wan
19354     1 nobody   SN    1264   1%   0% /bin/sh /usr/libexec/collectd/sqm_collectd.sh pppoe-wan ifb4pppoe-wan
28074     1 nobody   SN    1264   1%   0% /bin/sh /usr/libexec/collectd/sqm_collectd.sh pppoe-wan ifb4pppoe-wan
17649 17619 nobody   SN    1260   1%   0% /bin/sh /usr/libexec/collectd/sqm_collectd.sh pppoe-wan ifb4pppoe-wan
  509     1 root     S     1252   1%   0% /sbin/ubusd
  948     1 root     S     1244   1%   0% /sbin/logd -S 64
 7406  1260 root     S     1232   1%   0% /usr/sbin/pppd nodetach ipparam wan ifname pppoe-wan lcp-echo-interval 5 lc
 1323     1 root     S     1212   1%   0% /usr/sbin/crond -f -c /etc/crontabs -l 5
 2195     1 root     S<    1212   1%   0% /usr/sbin/ntpd -n -N -l -S /usr/sbin/ntpd-hotplug -p 0.fr.pool.ntp.org -p 1
17438 17437 root     S     1212   1%   0% -ash
17561  8142 root     S     1212   1%   0% sleep 600
17566  9768 nobody   SN    1208   1%   0% sleep 60
17580     1 nobody   SN    1208   1%   0% sleep 60
17581 28074 nobody   SN    1208   1%   0% sleep 60
17587 19354 nobody   SN    1208   1%   0% sleep 60
17479 23440 nobody   SN    1208   1%   0% sleep 60
17654 17649 nobody   SN    1208   1%   0% sleep 60
17660 15253 nobody   SN    1208   1%   0% sleep 60
17571 23127 nobody   SN    1208   1%   0% sleep 60

A possible workaround could be to break the forever loop after xx iterations.

i=100
while true ; do
        for ifc in "$@" ; do
                process_qdisc "$ifc"
        done
        sleep "${INTERVAL%%.*}"
        i=`expr $i - 1`
        [ $i -eq 0 ] && break 
done

wulfy23 commented 3 years ago

was also seeing this... search the forum... implemented local fix...

i think it has something to do with manually restarting collectd or luci_statitics or something...

feckert commented 3 years ago

As @wulfy23 mentioned this may be related. I think we need to modify the script and install a signal handler that terminates the script on a SIGTERM.

Have a look at this shell pattern to react to a signal after a sleep: https://github.com/openwrt/packages/blob/e36a65459a55e9bbf78d94a41ea93caa17f49779/net/mwan3/files/usr/sbin/mwan3track#L386-L389

Have a look at this shell pattern to install a signal handler: https://github.com/openwrt/packages/blob/e36a65459a55e9bbf78d94a41ea93caa17f49779/net/mwan3/files/usr/sbin/mwan3track#L209-L211

wulfy23 commented 3 years ago

fwiw this is my hack... (reap other instances on call)

for mPID in $(pgrep -f '/usr/libexec/collectd/sqm_collectd.sh'); do
    [ "$mPID" = "$$" ] && continue
    kill -9 "$mPID"
done

hnyman commented 3 years ago

cc @ldir-EDB0

ldir-EDB0 commented 3 years ago

This is quite strange and I've not seen this behaviour on my system, however I can replicate it by killing collectd in a forceful way eg kill -9. That means it doesn't get a chance to signal its children and hence the child process still runs even though it's an orphan.

Breaking the infinite 'while true' loop seems a sensible thing to do, replacing with something akin to 'while not an orphan ; do'. I'll think about that.

ldir-EDB0 commented 3 years ago

Instead of while true:

while not orphaned

while [ $(awk '$1 ~ "^PPid:" {print $2}' /proc/$$/status) -ne 1 ] ; do

wulfy23 commented 3 years ago

https://github.com/openwrt/packages/pull/16770

tested and functional

cheers

[ /usbstick 53°] ps w | grep collectd
17107 root      6076 SN   /usr/sbin/collectd -C /tmp/collectd.conf -f
17226 nobody    1416 SN   /bin/sh /usr/libexec/collectd/sqm_collectd.sh eth1
23370 root      1240 S    grep collectd

[ /usbstick 53°] kill -9 17107

[ /usbstick 54°] ps w | grep collectd
17226 nobody    1416 SN   /bin/sh /usr/libexec/collectd/sqm_collectd.sh eth1
23380 root      1240 S    grep collectd

[ /usbstick 54°] ps w | grep collectd
17226 nobody    1416 SN   /bin/sh /usr/libexec/collectd/sqm_collectd.sh eth1
23385 root      5988 SN   /usr/sbin/collectd -C /tmp/collectd.conf -f
23399 nobody    1316 SN   /bin/sh /usr/libexec/collectd/sqm_collectd.sh eth1
23407 root      1240 S    grep collectd

[ /usbstick 53°] ps w | grep collectd
23385 root      5988 SN   /usr/sbin/collectd -C /tmp/collectd.conf -f
23399 nobody    1316 SN   /bin/sh /usr/libexec/collectd/sqm_collectd.sh eth1
23414 root      1240 S    grep collectd

openwrt / packages

collectd-mod-sqm: many sqm_collectd.sh running after a long uptime #14302

while not orphaned