Open toreanderson opened 3 years ago
Hi Tore ( @toreanderson ),
May i ask you if you did try this in a lab scenario with no flow traffic passing? In that scenario it behaves as per your report; otherwise it does work just as documented. Reasoning: signals - like the SIGCHLD flagging a plugin process died - are blocked in critical sections of the code (which also includes listening for new packets) and checked and handled if needed where appropriate. This mechanism, to work well, assumes that the daemon has business to do (maybe not the best assumption but also not a bad one, right? A-la: nothing to do, who cares).
Should you find the above status quo not acceptable, a perhaps valid and easy fix would be to introduce some alarm kind of timeout to the recv() (listening for incoming packets) so that in case of zero incoming traffic, some routine checks (including signals checks) may be performed. Let me know.
Paolo
The steps to reproduce above was indeed reproduced using a minimal config in a lab setting without any incoming sFlow traffic, using then-current Git master.
That said, I did experience it first in production, in a more elaborate setup running v1.7.6. This is the config in question:
aggregate[in]: dst_as
aggregate[out]: src_as
networks_file: /etc/pmacct/networks.lst
plugins: sqlite3[in], sqlite3[out]
pre_tag_filter[in]: 1
pre_tag_filter[out]: 2
pre_tag_map: /etc/pmacct/internet_uplinks.map
sfacctd_as: longest
sfacctd_net: longest
sfacctd_port: 6344
sfacctd_renormalize: true
sql_db[in]: /var/lib/pmacct/in.db
sql_db[out]: /var/lib/pmacct/out.db
sql_history[in]: 1M
sql_history[out]: 1M
sql_history_roundoff[in]: M
sql_history_roundoff[out]: M
sql_optimize_clauses[in]: true
sql_optimize_clauses[out]: true
sql_refresh_time[in]: 60
sql_refresh_time[out]: 60
sql_table_schema[in]: /etc/pmacct/in.schema
sql_table_schema[out]: /etc/pmacct/out.schema
sql_table_version[in]: 9
sql_table_version[out]: 9
It was running fine. The issue #509 hit due to a NetBox import script run amok. This is how that incident looked in the logs:
2021-06-11T23:03:01.018778572+02:00 INFO ( out/sqlite3 ): *** Purging cache - START (PID: 20584) ***
2021-06-11T23:03:01.020009620+02:00 INFO ( in/sqlite3 ): *** Purging cache - START (PID: 20583) ***
2021-06-11T23:03:01.043585428+02:00 INFO ( out/sqlite3 ): *** Purging cache - END (PID: 20584, QN: 34/34, ET: 0) ***
2021-06-11T23:03:01.048145072+02:00 INFO ( in/sqlite3 ): *** Purging cache - END (PID: 20583, QN: 32/32, ET: 0) ***
2021-06-11T23:04:01.058988206+02:00 INFO ( in/sqlite3 ): *** Purging cache - START (PID: 20586) ***
2021-06-11T23:04:01.060416533+02:00 INFO ( out/sqlite3 ): *** Purging cache - START (PID: 20585) ***
2021-06-11T23:04:01.082198891+02:00 INFO ( in/sqlite3 ): *** Purging cache - END (PID: 20586, QN: 33/33, ET: 0) ***
2021-06-11T23:04:01.086447026+02:00 INFO ( out/sqlite3 ): *** Purging cache - END (PID: 20585, QN: 34/34, ET: 0) ***
2021-06-11T23:05:01.007407240+02:00 INFO ( out/sqlite3 ): *** Purging cache - START (PID: 20587) ***
2021-06-11T23:05:01.022361491+02:00 INFO ( out/sqlite3 ): *** Purging cache - END (PID: 20587, QN: 32/32, ET: 0) ***
2021-06-11T23:05:01.046064246+02:00 INFO ( in/sqlite3 ): *** Purging cache - START (PID: 20588) ***
2021-06-11T23:05:01.076810907+02:00 INFO ( in/sqlite3 ): *** Purging cache - END (PID: 20588, QN: 31/31, ET: 0) ***
2021-06-11T23:06:01.070376197+02:00 INFO ( out/sqlite3 ): *** Purging cache - START (PID: 20595) ***
2021-06-11T23:06:01.079636796+02:00 INFO ( in/sqlite3 ): *** Purging cache - START (PID: 20596) ***
2021-06-11T23:06:01.086366015+02:00 INFO ( out/sqlite3 ): [/etc/pmacct/networks.lst] map successfully (re)loaded.
2021-06-11T23:06:01.087576463+02:00 INFO ( in/sqlite3 ): [/etc/pmacct/networks.lst] map successfully (re)loaded.
2021-06-11T23:06:01.121216266+02:00 INFO ( in/sqlite3 ): *** Purging cache - END (PID: 20596, QN: 38/38, ET: 0) ***
2021-06-11T23:06:01.139476414+02:00 INFO ( out/sqlite3 ): *** Purging cache - END (PID: 20595, QN: 40/40, ET: 0) ***
2021-06-11T23:06:01.347761895+02:00 WARN ( default/core ): connection lost to 'in-sqlite3'; closing connection.
2021-06-14T08:24:49.673373133+02:00 INFO ( default/core ): OK, Exiting ...
2021-06-14T08:24:50.331224458+02:00 INFO ( default/core ): sFlow Accounting Daemon, sfacctd 1.7.6-git (20210216-0 (241b5dba))
So as you can see there is constant chatter from both the in
and out
plugins. That is normal. Then the broken networks.lst
file is reloaded (by both plugins), after which the core process reports having lost connectivity to the in
plugin (but not the out
plugin, for some reason). Anyway the core process stayed alive until days later when I noticed the data collection had stopped and I manually restarted it.
I can easily reproduce this in production too.
I start out with the following processes:
root 1320210 0.6 0.5 68668 29232 ? Ss 16:29 0:00 \_ sfacctd: Core Process [default]
root 1320221 0.1 0.5 72356 27224 ? S 16:29 0:00 \_ sfacctd: SQLite3 Plugin [in]
root 1320222 0.2 0.5 72356 27972 ? S 16:29 0:00 \_ sfacctd: SQLite3 Plugin [out]
Then I add a bunch of duplicate entries to the networks_file
and reload it:
for i in {1..134}; do echo 216,10.27.0.0/24; done | tee -a networks.lst
podman exec sfacctd killall5 12
Sure enough, the container logs the following:
2021-06-17T16:30:01.011147829+02:00 INFO ( in/sqlite3 ): *** Purging cache - END (PID: 8, QN: 0/31, ET: 0) ***
2021-06-17T16:30:01.014072417+02:00 INFO ( out/sqlite3 ): *** Purging cache - END (PID: 9, QN: 0/28, ET: 0) ***
2021-06-17T16:31:01.056481433+02:00 INFO ( out/sqlite3 ): *** Purging cache - START (PID: 16) ***
2021-06-17T16:31:01.065098937+02:00 INFO ( out/sqlite3 ): [/etc/pmacct/networks.lst] map successfully (re)loaded.
2021-06-17T16:31:01.076877760+02:00 INFO ( in/sqlite3 ): *** Purging cache - START (PID: 17) ***
2021-06-17T16:31:01.083972996+02:00 INFO ( in/sqlite3 ): [/etc/pmacct/networks.lst] map successfully (re)loaded.
2021-06-17T16:31:01.131600648+02:00 INFO ( out/sqlite3 ): *** Purging cache - END (PID: 16, QN: 41/41, ET: 0) ***
2021-06-17T16:31:01.137036885+02:00 INFO ( in/sqlite3 ): *** Purging cache - END (PID: 17, QN: 43/43, ET: 0) ***
2021-06-17T16:31:01.317873005+02:00 WARN ( default/core ): connection lost to 'in-sqlite3'; closing connection.
…and stops working.
Process list now looks like this:
root 1320210 0.4 0.5 68668 29232 ? Ss 16:29 0:00 \_ sfacctd: Core Process [default]
root 1320222 0.1 0.0 0 0 ? Z 16:29 0:00 \_ [sfacctd] <defunct>
root 1320317 0.0 0.0 0 0 ? Z 16:31 0:00 \_ [sfacctd] <defunct>
root 1320320 0.0 0.0 0 0 ? Z 16:31 0:00 \_ [sfacctd] <defunct>
However it does not always happen this way. Sometimes it will log that it loses connection to both plugins, and if it does, the core process will exit too (as expected). So it might be a race condition of some sort.
Hi Tore ( @toreanderson ),
I am sorry for the long time to pick this up again. You said The steps to reproduce above was indeed reproduced using a minimal config in a lab setting without any incoming sFlow traffic, using then-current Git master.
and i reckon i indeed need something basic in order to reproduce the issue and make progress with the troubleshooting. Can you share the minimal config you mention here?
Paolo
Hi,
That «minimal config in a lab setting» was the config described in https://github.com/pmacct/pmacct/issues/510#issue-920128107. This lab setup did not have any inbound sFlow traffic, however.
The production config where I experienced the problem is shown in https://github.com/pmacct/pmacct/issues/510#issuecomment-863294458. While it is a little bit more elaborate than the lab config, it should be easy enough to recreate in a lab setting with inbound sFlow packets, I would assume.
I've also seen this not exiting when a plugin fails with this config:
plugins: nfprobe,print
#make the whole process exit if a plugin fails (or it keeps running doing no flow exports)
plugin_exit_any: true
nfprobe_receiver: 10.1.2.3:2055
nfprobe_maxflows: 166667
plugin_pipe_size: 10240000 #10mb
#template this out so each worker gets a different ID - or multiple workers on same host can show errors on collector
nfprobe_engine: 0
The config isn't valid - it logs ERROR ( default_nfprobe/nfprobe ): parse_engine(): source_id is only supported on NetFlow v9/IPFIX exports.
(fix is to add nfprobe_version: 9
)
But it keeps running regardless. I would have thought the Core process would exit here as the nfprobe plugin exited. (I don't see pmacctd: Netflow Probe Plugin [default_nfprobe]
in ps
)
Using pmacct 1.7.2-3build1
Hi Alex ( @alexdarktrace ),
Version 1.7.2 is definitely too old. plugin_exit_any
was for example introduced only in 1.7.3.
Paolo
Description When a running
sfacctd
process attempts to reload anetworks_file
process containing >131 repeated entries (cf. issue #509), the plugin child process crashes. However, the core process continues running as if nothing happened.This runs counter to the documentation, which states that if all the plugins fail, the core process will as well:
https://github.com/pmacct/pmacct/blob/ba65dbe83db2a4c53302200a65dfd7c6c9311e2d/CONFIG-KEYS#L376-L382
Here's how to reproduce:
In one terminal, start
sfacctd
with an emptynetworks_file
:It starts up fine and continues to run in the foreground.
In another terminal, we can inspect the process list to see that the
sfacctd
core process has spawned one child process:Next, create a
networks_file
that reproduces issue #509 and makesfacctd
attempt to reload it by sending itUSR2
:Output from the
sfacctd
process in the first terminal confirms the reload signal was received. It continues to run in the foreground.However, by again checking the process list in the second terminal we can see that the child process has crashed and has terminated (but it has not been reaped by the core process, so it is in the defunct/"zombie" state):
I have also tried reproducing this with
plugin_exit_any: true
, but that makes no difference (and should in any case not be necessary since I only have a single plugin).Version