Core process does not exit even though child process crashes

toreanderson commented 3 years ago

Description When a running sfacctd process attempts to reload a networks_file process containing >131 repeated entries (cf. issue #509), the plugin child process crashes. However, the core process continues running as if nothing happened.

This runs counter to the documentation, which states that if all the plugins fail, the core process will as well:

https://github.com/pmacct/pmacct/blob/ba65dbe83db2a4c53302200a65dfd7c6c9311e2d/CONFIG-KEYS#L376-L382

Here's how to reproduce:

In one terminal, start sfacctd with an empty networks_file:

$ echo networks_file: /etc/pmacct/networks.lst > /etc/pmacct/sfacctd.conf
$ echo > /etc/pmacct/networks.lst
$ podman run --rm -it -v /etc/pmacct:/etc/pmacct docker.io/pmacct/sfacctd:bleeding-edge
WARN: [/etc/pmacct/sfacctd.conf] No plugin has been activated; defaulting to in-memory table.
INFO ( default/core ): sFlow Accounting Daemon, sfacctd 1.7.7-git (20210612-1 (ba65dbe8))
INFO ( default/core ):  '--enable-mysql' '--enable-pgsql' '--enable-sqlite3' '--enable-kafka' '--enable-geoipv2' '--enable-jansson' '--enable-rabbitmq' '--enable-nflog' '--enable-ndpi' '--enable-zmq' '--enable-avro' '--enable-serdes' '--enable-redis' '--enable-gnutls' 'AVRO_CFLAGS=-I/usr/local/avro/include' 'AVRO_LIBS=-L/usr/local/avro/lib -lavro' '--enable-l2' '--enable-traffic-bins' '--enable-bgp-bins' '--enable-bmp-bins' '--enable-st-bins'
INFO ( default/core ): Reading configuration file '/etc/pmacct/sfacctd.conf'.
WARN ( default_memory/memory ): defaulting to SRC HOST aggregation.
INFO ( default/core ): [/etc/pmacct/networks.lst] map successfully (re)loaded.
INFO ( default_memory/memory ): [/etc/pmacct/networks.lst] map successfully (re)loaded.
INFO ( default/core ): waiting for sFlow data on :::6343
OK ( default_memory/memory ): waiting for data on: '/tmp/collect.pipe'
(…)

It starts up fine and continues to run in the foreground.

In another terminal, we can inspect the process list to see that the sfacctd core process has spawned one child process:

$ ps xf | tail -n 3
3873816 ?        Ssl    0:00 /usr/bin/conmon --api-version 1 -c 95920fcadc9eef946eff51f533ac8b2c4d8f6219d6fee4da1059fc4b214b3796 -u 95920fcadc9eef946eff51f533ac8b2c4d8f6219d6fee4da1059fc4b214b3796 -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/95920fcadc9eef946eff51f533ac8b2c4d8f6219d6fee4da1059fc4b214b3796/userdata -p /var/run/containers/storage/overlay-containers/95920fcadc9eef946eff51f533ac8b2c4d8f6219d6fee4da1059fc4b214b3796/userdata/pidfile -n cool_chebyshev --exit-dir /var/run/libpod/exits --socket-dir-path /var/run/libpod/socket -s -l k8s-file:/var/lib/containers/storage/overlay-containers/95920fcadc9eef946eff51f533ac8b2c4d8f6219d6fee4da1059fc4b214b3796/userdata/ctr.log --log-level error --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/var/run/containers/storage/overlay-containers/95920fcadc9eef946eff51f533ac8b2c4d8f6219d6fee4da1059fc4b214b3796/userdata/oci-log -t --conmon-pidfile /var/run/containers/storage/overlay-containers/95920fcadc9eef946eff51f533ac8b2c4d8f6219d6fee4da1059fc4b214b3796/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /var/run/containers/storage --exit-command-arg --log-level --exit-command-arg error --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --rm --exit-command-arg 95920fcadc9eef946eff51f533ac8b2c4d8f6219d6fee4da1059fc4b214b3796
3873821 pts/0    Ss+    0:00  \_ sfacctd: Core Process [default]
3873827 pts/0    S+     0:00      \_ sfacctd: IMT Plugin [default_memory]

Next, create a networks_file that reproduces issue #509 and make sfacctd attempt to reload it by sending it USR2:

$ for i in {1..132}; do echo 65000,192.0.2.1/32; done > /etc/pmacct/networks.lst
$ podman exec --latest killall5 12

Output from the sfacctd process in the first terminal confirms the reload signal was received. It continues to run in the foreground.

INFO ( default_memory/memory ): [/etc/pmacct/networks.lst] map successfully (re)loaded.

However, by again checking the process list in the second terminal we can see that the child process has crashed and has terminated (but it has not been reaped by the core process, so it is in the defunct/"zombie" state):

$ ps xf | tail -n 3
3873816 ?        Ssl    0:00 /usr/bin/conmon --api-version 1 -c 95920fcadc9eef946eff51f533ac8b2c4d8f6219d6fee4da1059fc4b214b3796 -u 95920fcadc9eef946eff51f533ac8b2c4d8f6219d6fee4da1059fc4b214b3796 -r /usr/bin/crun -b /var/lib/containers/storage/overlay-containers/95920fcadc9eef946eff51f533ac8b2c4d8f6219d6fee4da1059fc4b214b3796/userdata -p /var/run/containers/storage/overlay-containers/95920fcadc9eef946eff51f533ac8b2c4d8f6219d6fee4da1059fc4b214b3796/userdata/pidfile -n cool_chebyshev --exit-dir /var/run/libpod/exits --socket-dir-path /var/run/libpod/socket -s -l k8s-file:/var/lib/containers/storage/overlay-containers/95920fcadc9eef946eff51f533ac8b2c4d8f6219d6fee4da1059fc4b214b3796/userdata/ctr.log --log-level error --runtime-arg --log-format=json --runtime-arg --log --runtime-arg=/var/run/containers/storage/overlay-containers/95920fcadc9eef946eff51f533ac8b2c4d8f6219d6fee4da1059fc4b214b3796/userdata/oci-log -t --conmon-pidfile /var/run/containers/storage/overlay-containers/95920fcadc9eef946eff51f533ac8b2c4d8f6219d6fee4da1059fc4b214b3796/userdata/conmon.pid --exit-command /usr/bin/podman --exit-command-arg --root --exit-command-arg /var/lib/containers/storage --exit-command-arg --runroot --exit-command-arg /var/run/containers/storage --exit-command-arg --log-level --exit-command-arg error --exit-command-arg --cgroup-manager --exit-command-arg systemd --exit-command-arg --tmpdir --exit-command-arg /var/run/libpod --exit-command-arg --runtime --exit-command-arg crun --exit-command-arg --storage-driver --exit-command-arg overlay --exit-command-arg --storage-opt --exit-command-arg overlay.mountopt=nodev,metacopy=on --exit-command-arg --events-backend --exit-command-arg journald --exit-command-arg container --exit-command-arg cleanup --exit-command-arg --rm --exit-command-arg 95920fcadc9eef946eff51f533ac8b2c4d8f6219d6fee4da1059fc4b214b3796
3873821 pts/0    Ss+    0:00  \_ sfacctd: Core Process [default]
3873827 pts/0    Z+     0:00      \_ [sfacctd] <defunct>

I have also tried reproducing this with plugin_exit_any: true, but that makes no difference (and should in any case not be necessary since I only have a single plugin).

Version

$ podman run --rm -it -v /etc/pmacct:/etc/pmacct docker.io/pmacct/sfacctd:bleeding-edge -V
sFlow Accounting Daemon, sfacctd 1.7.7-git [20210612-1 (ba65dbe8)]

Arguments:
 '--enable-mysql' '--enable-pgsql' '--enable-sqlite3' '--enable-kafka' '--enable-geoipv2' '--enable-jansson' '--enable-rabbitmq' '--enable-nflog' '--enable-ndpi' '--enable-zmq' '--enable-avro' '--enable-serdes' '--enable-redis' '--enable-gnutls' 'AVRO_CFLAGS=-I/usr/local/avro/include' 'AVRO_LIBS=-L/usr/local/avro/lib -lavro' '--enable-l2' '--enable-traffic-bins' '--enable-bgp-bins' '--enable-bmp-bins' '--enable-st-bins'

Libs:
cdada 0.3.4
libpcap version 1.8.1
MariaDB 10.3.27
PostgreSQL 110012
sqlite3 3.27.2
rabbimq-c 0.11.0
rdkafka 1.7.0
jansson 2.13.1
MaxmindDB 1.6.0
ZeroMQ 4.3.2
Redis 1.0.1
GnuTLS 3.6.7
avro-c
serdes
nDPI 3.4.0
netfilter_log

Plugins:
memory
print
nfprobe
sfprobe
tee
mysql
postgresql
sqlite
amqp
kafka

System:
Linux 5.8.18-100.fc31.x86_64 #1 SMP Mon Nov 2 20:32:55 UTC 2020 x86_64

Compiler:
gcc 8.3.0

For suggestions, critics, bugs, contact me: Paolo Lucente <paolo@pmacct.net>.

paololucente commented 3 years ago

Hi Tore ( @toreanderson ),

May i ask you if you did try this in a lab scenario with no flow traffic passing? In that scenario it behaves as per your report; otherwise it does work just as documented. Reasoning: signals - like the SIGCHLD flagging a plugin process died - are blocked in critical sections of the code (which also includes listening for new packets) and checked and handled if needed where appropriate. This mechanism, to work well, assumes that the daemon has business to do (maybe not the best assumption but also not a bad one, right? A-la: nothing to do, who cares).

Should you find the above status quo not acceptable, a perhaps valid and easy fix would be to introduce some alarm kind of timeout to the recv() (listening for incoming packets) so that in case of zero incoming traffic, some routine checks (including signals checks) may be performed. Let me know.

Paolo

toreanderson commented 3 years ago

The steps to reproduce above was indeed reproduced using a minimal config in a lab setting without any incoming sFlow traffic, using then-current Git master.

That said, I did experience it first in production, in a more elaborate setup running v1.7.6. This is the config in question:

aggregate[in]: dst_as
aggregate[out]: src_as
networks_file: /etc/pmacct/networks.lst
plugins: sqlite3[in], sqlite3[out]
pre_tag_filter[in]: 1
pre_tag_filter[out]: 2
pre_tag_map: /etc/pmacct/internet_uplinks.map
sfacctd_as: longest
sfacctd_net: longest
sfacctd_port: 6344
sfacctd_renormalize: true
sql_db[in]: /var/lib/pmacct/in.db
sql_db[out]: /var/lib/pmacct/out.db
sql_history[in]: 1M
sql_history[out]: 1M
sql_history_roundoff[in]: M
sql_history_roundoff[out]: M
sql_optimize_clauses[in]: true
sql_optimize_clauses[out]: true
sql_refresh_time[in]: 60
sql_refresh_time[out]: 60
sql_table_schema[in]: /etc/pmacct/in.schema
sql_table_schema[out]: /etc/pmacct/out.schema
sql_table_version[in]: 9
sql_table_version[out]: 9

It was running fine. The issue #509 hit due to a NetBox import script run amok. This is how that incident looked in the logs:

2021-06-11T23:03:01.018778572+02:00 INFO ( out/sqlite3 ): *** Purging cache - START (PID: 20584) ***
2021-06-11T23:03:01.020009620+02:00 INFO ( in/sqlite3 ): *** Purging cache - START (PID: 20583) ***
2021-06-11T23:03:01.043585428+02:00 INFO ( out/sqlite3 ): *** Purging cache - END (PID: 20584, QN: 34/34, ET: 0) ***
2021-06-11T23:03:01.048145072+02:00 INFO ( in/sqlite3 ): *** Purging cache - END (PID: 20583, QN: 32/32, ET: 0) ***
2021-06-11T23:04:01.058988206+02:00 INFO ( in/sqlite3 ): *** Purging cache - START (PID: 20586) ***
2021-06-11T23:04:01.060416533+02:00 INFO ( out/sqlite3 ): *** Purging cache - START (PID: 20585) ***
2021-06-11T23:04:01.082198891+02:00 INFO ( in/sqlite3 ): *** Purging cache - END (PID: 20586, QN: 33/33, ET: 0) ***
2021-06-11T23:04:01.086447026+02:00 INFO ( out/sqlite3 ): *** Purging cache - END (PID: 20585, QN: 34/34, ET: 0) ***
2021-06-11T23:05:01.007407240+02:00 INFO ( out/sqlite3 ): *** Purging cache - START (PID: 20587) ***
2021-06-11T23:05:01.022361491+02:00 INFO ( out/sqlite3 ): *** Purging cache - END (PID: 20587, QN: 32/32, ET: 0) ***
2021-06-11T23:05:01.046064246+02:00 INFO ( in/sqlite3 ): *** Purging cache - START (PID: 20588) ***
2021-06-11T23:05:01.076810907+02:00 INFO ( in/sqlite3 ): *** Purging cache - END (PID: 20588, QN: 31/31, ET: 0) ***
2021-06-11T23:06:01.070376197+02:00 INFO ( out/sqlite3 ): *** Purging cache - START (PID: 20595) ***
2021-06-11T23:06:01.079636796+02:00 INFO ( in/sqlite3 ): *** Purging cache - START (PID: 20596) ***
2021-06-11T23:06:01.086366015+02:00 INFO ( out/sqlite3 ): [/etc/pmacct/networks.lst] map successfully (re)loaded.
2021-06-11T23:06:01.087576463+02:00 INFO ( in/sqlite3 ): [/etc/pmacct/networks.lst] map successfully (re)loaded.
2021-06-11T23:06:01.121216266+02:00 INFO ( in/sqlite3 ): *** Purging cache - END (PID: 20596, QN: 38/38, ET: 0) ***
2021-06-11T23:06:01.139476414+02:00 INFO ( out/sqlite3 ): *** Purging cache - END (PID: 20595, QN: 40/40, ET: 0) ***
2021-06-11T23:06:01.347761895+02:00 WARN ( default/core ): connection lost to 'in-sqlite3'; closing connection.
2021-06-14T08:24:49.673373133+02:00 INFO ( default/core ): OK, Exiting ...
2021-06-14T08:24:50.331224458+02:00 INFO ( default/core ): sFlow Accounting Daemon, sfacctd 1.7.6-git (20210216-0 (241b5dba))

So as you can see there is constant chatter from both the in and out plugins. That is normal. Then the broken networks.lst file is reloaded (by both plugins), after which the core process reports having lost connectivity to the in plugin (but not the out plugin, for some reason). Anyway the core process stayed alive until days later when I noticed the data collection had stopped and I manually restarted it.

I can easily reproduce this in production too.

I start out with the following processes:

root     1320210  0.6  0.5  68668 29232 ?        Ss   16:29   0:00  \_ sfacctd: Core Process [default]
root     1320221  0.1  0.5  72356 27224 ?        S    16:29   0:00      \_ sfacctd: SQLite3 Plugin [in]
root     1320222  0.2  0.5  72356 27972 ?        S    16:29   0:00      \_ sfacctd: SQLite3 Plugin [out]

Then I add a bunch of duplicate entries to the networks_file and reload it:

for i in {1..134}; do echo 216,10.27.0.0/24; done | tee -a networks.lst
podman exec sfacctd killall5 12

Sure enough, the container logs the following:

2021-06-17T16:30:01.011147829+02:00 INFO ( in/sqlite3 ): *** Purging cache - END (PID: 8, QN: 0/31, ET: 0) ***
2021-06-17T16:30:01.014072417+02:00 INFO ( out/sqlite3 ): *** Purging cache - END (PID: 9, QN: 0/28, ET: 0) ***
2021-06-17T16:31:01.056481433+02:00 INFO ( out/sqlite3 ): *** Purging cache - START (PID: 16) ***
2021-06-17T16:31:01.065098937+02:00 INFO ( out/sqlite3 ): [/etc/pmacct/networks.lst] map successfully (re)loaded.
2021-06-17T16:31:01.076877760+02:00 INFO ( in/sqlite3 ): *** Purging cache - START (PID: 17) ***
2021-06-17T16:31:01.083972996+02:00 INFO ( in/sqlite3 ): [/etc/pmacct/networks.lst] map successfully (re)loaded.
2021-06-17T16:31:01.131600648+02:00 INFO ( out/sqlite3 ): *** Purging cache - END (PID: 16, QN: 41/41, ET: 0) ***
2021-06-17T16:31:01.137036885+02:00 INFO ( in/sqlite3 ): *** Purging cache - END (PID: 17, QN: 43/43, ET: 0) ***
2021-06-17T16:31:01.317873005+02:00 WARN ( default/core ): connection lost to 'in-sqlite3'; closing connection.

…and stops working.

Process list now looks like this:

root     1320210  0.4  0.5  68668 29232 ?        Ss   16:29   0:00  \_ sfacctd: Core Process [default]
root     1320222  0.1  0.0      0     0 ?        Z    16:29   0:00      \_ [sfacctd] <defunct>
root     1320317  0.0  0.0      0     0 ?        Z    16:31   0:00      \_ [sfacctd] <defunct>
root     1320320  0.0  0.0      0     0 ?        Z    16:31   0:00      \_ [sfacctd] <defunct>

However it does not always happen this way. Sometimes it will log that it loses connection to both plugins, and if it does, the core process will exit too (as expected). So it might be a race condition of some sort.

paololucente commented 3 years ago

Hi Tore ( @toreanderson ),

I am sorry for the long time to pick this up again. You said The steps to reproduce above was indeed reproduced using a minimal config in a lab setting without any incoming sFlow traffic, using then-current Git master. and i reckon i indeed need something basic in order to reproduce the issue and make progress with the troubleshooting. Can you share the minimal config you mention here?

Paolo

toreanderson commented 3 years ago

Hi,

That «minimal config in a lab setting» was the config described in https://github.com/pmacct/pmacct/issues/510#issue-920128107. This lab setup did not have any inbound sFlow traffic, however.

The production config where I experienced the problem is shown in https://github.com/pmacct/pmacct/issues/510#issuecomment-863294458. While it is a little bit more elaborate than the lab config, it should be easy enough to recreate in a lab setting with inbound sFlow packets, I would assume.

alexdarktrace commented 3 years ago

I've also seen this not exiting when a plugin fails with this config:

plugins: nfprobe,print
#make the whole process exit if a plugin fails (or it keeps running doing no flow exports)
plugin_exit_any: true
nfprobe_receiver: 10.1.2.3:2055
nfprobe_maxflows: 166667
plugin_pipe_size: 10240000 #10mb

#template this out so each worker gets a different ID - or multiple workers on same host can show errors on collector
nfprobe_engine: 0

The config isn't valid - it logs ERROR ( default_nfprobe/nfprobe ): parse_engine(): source_id is only supported on NetFlow v9/IPFIX exports. (fix is to add nfprobe_version: 9 )

But it keeps running regardless. I would have thought the Core process would exit here as the nfprobe plugin exited. (I don't see pmacctd: Netflow Probe Plugin [default_nfprobe] in ps)

Using pmacct 1.7.2-3build1

paololucente commented 3 years ago

Hi Alex ( @alexdarktrace ),

Version 1.7.2 is definitely too old. plugin_exit_any was for example introduced only in 1.7.3.

Paolo

pmacct / pmacct

Core process does not exit even though child process crashes #510