Closed jkingcis closed 6 years ago
Hi @jkingcis thank you for the detailed informations you provided. From your description, mainly for the timing, it seems this is related to some traffic pattern (e.g. fragments), as you said. I am trying to isolate the code affected by this and I have a couple of requests:
Hi @cardigliano Thanks for your fast answer. The exact pf_ring revision we are using on both hosts is:
[root@hostA ~]# cat /proc/net/pf_ring/info
PF_RING Version : 6.6.0 (6.6.0-stable:c3961efb76c70754d7c0a17f8496b8c2c35e3fb4)
Total rings : 21
Standard (non ZC) Options
Ring slots : 4096
Slot version : 16
Capture TX : Yes [RX+TX]
IP Defragment : No
Socket Mode : Standard
Cluster Fragment Queue : 10558
Cluster Fragment Discard : 0
[root@hostB ~]# cat /proc/net/pf_ring/info
PF_RING Version : 7.0.0 (7.0.0-stable:83595f31baa6a4daa556801c2f1e261e144618a8)
Total rings : 20
Standard (non ZC) Options
Ring slots : 4096
Slot version : 16
Capture TX : Yes [RX+TX]
IP Defragment : No
Socket Mode : Standard
Cluster Fragment Queue : 0
Cluster Fragment Discard : 0
As for the kernel clustering term, I'm unsure about what refers to. In suricata.yaml config file, I have this configuration regarding clusters (is the same for both hosts):
# PF_RING configuration. for use with native PF_RING support
# for more info see http://www.ntop.org/products/pf_ring/
pfring:
- interface: p2p1
# Number of receive threads (>1 will enable experimental flow pinned
# runmode)
threads: 10
# Default clusterid. PF_RING will load balance packets based on flow.
# All threads/processes that will participate need to have the same
# clusterid.
cluster-id: 99
# Default PF_RING cluster type. PF_RING can load balance per flow.
# Possible values are cluster_flow or cluster_round_robin.
cluster-type: cluster_flow
# bpf filter for this interface
#bpf-filter: tcp
# Choose checksum verification mode for the interface. At the moment
# of the capture, some packets may be with an invalid checksum due to
# offloading to the network card of the checksum computation.
# Possible values are:
# - rxonly: only compute checksum for packets received by network card.
# - yes: checksum validation is forced
# - no: checksum validation is disabled
# - auto: suricata uses a statistical approach to detect when
# checksum off-loading is used. (default)
# Warning: 'checksum-validation' must be set to yes to have any validation
#checksum-checks: auto
# Second interface
- interface: p2p2
threads: 10
cluster-id: 93
cluster-type: cluster_flow
Should I check anywhere else the kernel clustering setup?
Thank you for your time
@jkingcis did you install pf_ring from packages or from source code? (i.e. do you have the PF_RING folder with the source code, used to compile pf_ring, in your home?) As of clustering, it seems to be enabled in Suricata, you can double check with: cat /proc/net/pf_ring/- | grep "Cluster Id"
@cardigliano I did install PF_RING from the ntop repo, but Suricata was compiled from source with the --enable-pfring flag during the configure process:
# PF_RING installation
wget http://packages.ntop.org/centos-stable/ntop.repo -O /etc/yum.repos.d/ntop.repo
yum install pfring pfring-dkms
Yes, it seems clustering is enabled in Suricata, as:
[root@hostA ~]# cat /proc/net/pf_ring/* | grep "Cluster Id"
Cluster Id : 99
Cluster Id : 99
Cluster Id : 99
Cluster Id : 99
Cluster Id : 99
Cluster Id : 99
Cluster Id : 99
Cluster Id : 99
Cluster Id : 99
Cluster Id : 99
Cluster Id : 93
Cluster Id : 93
Cluster Id : 93
Cluster Id : 93
Cluster Id : 93
Cluster Id : 93
Cluster Id : 93
Cluster Id : 93
Cluster Id : 93
Cluster Id : 93
[root@hostB ~]# cat /proc/net/pf_ring/* | grep "Cluster Id"
Cluster Id : 99
Cluster Id : 99
Cluster Id : 99
Cluster Id : 99
Cluster Id : 99
Cluster Id : 99
Cluster Id : 99
Cluster Id : 99
Cluster Id : 99
Cluster Id : 99
Cluster Id : 93
Cluster Id : 93
Cluster Id : 93
Cluster Id : 93
Cluster Id : 93
Cluster Id : 93
Cluster Id : 93
Cluster Id : 93
Cluster Id : 93
Cluster Id : 93
@jkingcis, after static code analysis, driven by your input, I found something that seems to be related and I pushed a patch. There was a probability for non-ip packets to be recognized as ip fragments during hash computation when the fragments cache was enabled (which is the default). I recommend you to upgrade to latest 7.2 stable (new packages will be available in <1h). Thank you.
@cardigliano Yesterday I yum-updated Host B, including pfring, pfring-dkms, kernel and glibc. After re-installing the pf_ring kernel module once the system booted with the new kernel (yum reinstall pfring-dkms after reboot), everything seems to be working fine. The current setup is:
Host B:
[root@hostB ~]# cat /proc/net/pf_ring/info
PF_RING Version : 7.2.0 (7.2.0-stable:758f764cf6243c81d61d7b6d7cd1ccee04aaa1b6)
Total rings : 20
Standard (non ZC) Options
Ring slots : 4096
Slot version : 17
Capture TX : Yes [RX+TX]
IP Defragment : No
Socket Mode : Standard
Cluster Fragment Queue : 647
Cluster Fragment Discard : 0
[root@hostB ~]# modinfo pf_ring
filename: /lib/modules/3.10.0-862.11.6.el7.x86_64/extra/pf_ring.ko.xz
alias: net-pf-27
version: 7.2.0
description: Packet capture acceleration and analysis
author: ntop.org
license: GPL
retpoline: Y
rhelversion: 7.5
srcversion: 0B168ED2BB628A0718E64F9
depends:
vermagic: 3.10.0-862.11.6.el7.x86_64 SMP mod_unload modversions
parm: min_num_slots:Min number of ring slots (uint)
parm: perfect_rules_hash_size:Perfect rules hash size (uint)
parm: enable_tx_capture:Set to 1 to capture outgoing packets (uint)
parm: enable_frag_coherence:Set to 1 to handle fragments (flow coherence) in clusters (uint)
parm: enable_ip_defrag:Set to 1 to enable IP defragmentation(only rx traffic is defragmentead) (uint)
parm: quick_mode:Set to 1 to run at full speed but with upto one socket per interface (uint)
parm: force_ring_lock:Set to 1 to force ring locking (automatically enable with rss) (uint)
parm: enable_debug:Set to 1 to enable PF_RING debug tracing into the syslog, 2 for more verbosity (uint)
parm: transparent_mode:(deprecated) (uint)
[root@hostB ~]# systemctl status pf_ring
● pf_ring.service - Start/stop pfring service
Loaded: loaded (/usr/lib/systemd/system/pf_ring.service; enabled; vendor preset: disabled)
Active: active (exited) since jue 2018-08-23 10:01:41 CEST; 22h ago
Process: 15912 ExecStop=/usr/local/bin/pf_ringctl stop (code=exited, status=0/SUCCESS)
Process: 15933 ExecStart=/usr/local/bin/pf_ringctl start (code=exited, status=0/SUCCESS)
Main PID: 15933 (code=exited, status=0/SUCCESS)
CGroup: /system.slice/pf_ring.service
ago 23 10:01:32 hostB systemd[1]: Starting Start/stop pfring service...
ago 23 10:01:41 hostB pf_ringctl[15933]: Starting PF_RING module: [ OK ]
ago 23 10:01:41 hostB systemd[1]: Started Start/stop pfring service.
I'm considering waiting about two weeks to update Host A, just in case the traffic pattern appears again and we can test if host A does panic and host B does not. What do you think about it?
Many, MANY thanks for your time @cardigliano
Hi @jkingcis, sounds great, please keep us posted.
Hi @cardigliano, I just yum-upgraded hostA and yum-reinstalled pfring-dkms after reboot with the new kernel, but Suricata won't start stating "/usr/local/bin/suricata: error while loading shared libraries: libpfring.so: cannot open shared object file: No such file or directory". I have tried reinstalling again pfring package but it won't work either. Do you know what's going on?
[root@hostA ~]# systemctl status suricata
● suricata.service - Suricata Intrusion Detection Service
Loaded: loaded (/usr/lib/systemd/system/suricata.service; enabled; vendor preset: disabled)
Active: failed (Result: exit-code) since mar 2018-08-28 17:37:19 CEST; 5min ago
Process: 4620 ExecStartPost=/usr/bin/chmod g+w /var/run/suricata/suricata-command.socket (code=exited, status=1/FAILURE)
Process: 4456 ExecStartPost=/usr/bin/sleep 120 (code=exited, status=0/SUCCESS)
Process: 4455 ExecStart=/usr/local/bin/suricata -c /etc/suricata/suricata.yaml $OPTIONS (code=exited, status=127)
Process: 4451 ExecStartPre=/usr/bin/chown suricata.suricata /var/run/suricata (code=exited, status=0/SUCCESS)
Process: 4449 ExecStartPre=/usr/bin/mkdir /var/run/suricata (code=exited, status=1/FAILURE)
Main PID: 4455 (code=exited, status=127)
ago 28 17:35:19 hostA systemd[1]: Starting Suricata Intrusion Detection Service...
ago 28 17:35:19 hostA mkdir[4449]: /usr/bin/mkdir: no se puede crear el directorio «/var/run/suricata»: El fichero ya existe
ago 28 17:35:19 hostA suricata[4455]: /usr/local/bin/suricata: error while loading shared libraries: libpfring.so: cannot open shared object file: No such file or directory
ago 28 17:35:19 hostA systemd[1]: suricata.service: main process exited, code=exited, status=127/n/a
ago 28 17:37:19 hostA chmod[4620]: /usr/bin/chmod: no se puede acceder a «/var/run/suricata/suricata-command.socket»: No existe el fichero o el directorio
ago 28 17:37:19 hostA systemd[1]: suricata.service: control process exited, code=exited status=1
ago 28 17:37:19 hostA systemd[1]: Failed to start Suricata Intrusion Detection Service.
ago 28 17:37:19 hostA systemd[1]: Unit suricata.service entered failed state.
ago 28 17:37:19 hostA systemd[1]: suricata.service failed.
[root@hostA ~]# systemctl status pf_ring
● pf_ring.service - Start/stop pfring service
Loaded: loaded (/usr/lib/systemd/system/pf_ring.service; enabled; vendor preset: disabled)
Active: active (exited) since mar 2018-08-28 17:34:56 CEST; 7min ago
Process: 4028 ExecStop=/usr/local/bin/pf_ringctl stop (code=exited, status=0/SUCCESS)
Process: 4097 ExecStart=/usr/local/bin/pf_ringctl start (code=exited, status=0/SUCCESS)
Main PID: 4097 (code=exited, status=0/SUCCESS)
CGroup: /system.slice/pf_ring.service
ago 28 17:34:51 hostA systemd[1]: Starting Start/stop pfring service...
ago 28 17:34:56 hostA pf_ringctl[4097]: Starting PF_RING module: [ OK ]
ago 28 17:34:56 hostA systemd[1]: Started Start/stop pfring service.
[root@hostA ~]# modinfo pf_ring
filename: /lib/modules/3.10.0-862.11.6.el7.x86_64/extra/pf_ring.ko.xz
alias: net-pf-27
version: 7.2.0
description: Packet capture acceleration and analysis
author: ntop.org
license: GPL
retpoline: Y
rhelversion: 7.5
srcversion: 5FB5FFB600582E9D7D045AF
depends:
vermagic: 3.10.0-862.11.6.el7.x86_64 SMP mod_unload modversions
parm: min_num_slots:Min number of ring slots (uint)
parm: perfect_rules_hash_size:Perfect rules hash size (uint)
parm: enable_tx_capture:Set to 1 to capture outgoing packets (uint)
parm: enable_frag_coherence:Set to 1 to handle fragments (flow coherence) in clusters (uint)
parm: enable_ip_defrag:Set to 1 to enable IP defragmentation(only rx traffic is defragmentead) (uint)
parm: quick_mode:Set to 1 to run at full speed but with upto one socket per interface (uint)
parm: force_ring_lock:Set to 1 to force ring locking (automatically enable with rss) (uint)
parm: enable_debug:Set to 1 to enable PF_RING debug tracing into the syslog, 2 for more verbosity (uint)
parm: transparent_mode:(deprecated) (uint)
[root@hostA ~]# cat /proc/net/pf_ring/info
PF_RING Version : 7.2.0 (7.2.0-stable:e0a75608922f751f7280bb7ae03d3be41226955b)
Total rings : 0
Standard (non ZC) Options
Ring slots : 4096
Slot version : 17
Capture TX : Yes [RX+TX]
IP Defragment : No
Socket Mode : Standard
Cluster Fragment Queue : 0
Cluster Fragment Discard : 0
Many thanks
I have tried comparing both libpfring.so libraries from Host A and Host B and they both share the same MD5 hash, so it seems it's the same library on both hosts, but I don't get it why Host A is not starting Suricata
Hi @cardigliano, I work-arounded the problem by moving Suricata binary from Host B to Host A and now it seems to be working fine. The root cause for me is that Host A was upgrading directly from PF_RING 6.6.0 to PF_RING 7.2.0 (major release) as for Host B's 7.0.0 to 7.2.0, and therefore there was some problems with libraries path or something related. I have included a recompilation of Suricata in every PF_RING major release in my updating procedure documentation, so it seems to be solved :)
HI @jkingcis, great, please reopen in case you experience a similar issue again. Thank you.
Hello,
After using PF_RING for some years, I'm facing some strange issue. Short story is:
Host A:
Host B:
Kernel dump of Host A:
I can see the assembly command that was running:
I also see the kernel seems tainted G, which means: 4096 - An out-of-tree module has been loaded. 8192 - An unsigned module has been loaded in a kernel supporting module signature.
The backtrace where I see again the RIP and the packet_rcv just before the crash
This portion of code is found on pfring.c here:
The PF_RING module version loaded is:
Kernel dump of Host B:
I can see the assembly command that was running:
I also see the kernel seems tainted G, which means: 4096 - An out-of-tree module has been loaded. 8192 - An unsigned module has been loaded in a kernel supporting module signature.
The backtrace where I see again the RIP and the packet_rcv just before the crash:
This portion of code is found on pfring.c here:
The PF_RING module version loaded is:
I'm a bit afraid of upgrading the system and/or PF_RING just in case it breaks something and the system doesn't work anymore. What do you think about it? They both manage the same SPAN traffic, about 5Gbps average each one. I'm thinking maybe it's a bug that occurs when some kind of traffic pattern shows up.
Many thanks in advance,
Best regards