ntop / PF_RING

High-speed packet processing framework
http://www.ntop.org
GNU Lesser General Public License v2.1
2.68k stars 353 forks source link

Consistent deadlock (?) after several days of correct operation #451

Closed thelamb closed 5 years ago

thelamb commented 5 years ago

This has occurred in two separate installations. After about 2-3 days of successful operation, PF_RING appears to deadlock (no call to pfring_recv returns even though packets are hitting the interface). Restarting the client process resumes normal operation for 2-3 days but the problem consistently returns.

How do I know the deadlock is not in our code? All threads are waiting for 'poll' to return, and the pf_ring stats are not increasing even though ifconfig stats are.

pfring/info

PF_RING Version : 7.2.0 (unknown) Total rings : 15

Standard (non ZC) Options Ring slots : 4096 Slot version : 17 Capture TX : Yes [RX+TX] IP Defragment : No Socket Mode : Standard Cluster Fragment Queue : 0 Cluster Fragment Discard : 0

Note that the other installation where this occurred is running 7.0.0.

pf_ring stats vs ifconfig Executed cat /proc/net/pf_ring/*-* | grep "Tot P" and ifconfig ens1f0 twice with a few seconds between the commands. Output is in comments below.

Note that the stats returned from pf_ring are not increasing while the RX bytes returned from ifconfig are.

Next We currently have the process running in this deadlock state, so I'm able to get more information from the system as necessary. We'd appreciate your help to get to the bottom of this.

I understand that recently a new version has been released, but the changelog does not mention anything that seems to address this

Thanks for your time,

Chris

cardigliano commented 5 years ago

@thelamb what application are you using? Does it implement multithreaded capture? Unfortunately I am not able to download the files you provided. Could you try reproducing the same with pfcount?

thelamb commented 5 years ago

We actually have two applications showing the same behavior: our own NIDS (we have over 9 years of experience with PF_RING and hundreds of successful deployments) and a tool used to dump traffic to disk, based on pfdump. The behavior is seen with only the NIDS, and when the NIDS and disk-dump tool run next to each other.

We are not using ZC, BTW.

I will ask the customer to run pfcount in a tmux session and monitor it over the next few days.

The attached files contain the following: pf_ring_packets.txt Tot Packets : 23718 Tot Pkt Lost : 0 Tot Packets : 7935 Tot Pkt Lost : 0 Tot Packets : 8587 Tot Pkt Lost : 0 Tot Packets : 21133 Tot Pkt Lost : 0 Tot Packets : 10764 Tot Pkt Lost : 0 Tot Packets : 38644 Tot Pkt Lost : 0 Tot Packets : 21841 Tot Pkt Lost : 0 Tot Packets : 9139 Tot Pkt Lost : 0 Tot Packets : 22077 Tot Pkt Lost : 0 Tot Packets : 10679 Tot Pkt Lost : 0 Tot Packets : 7382 Tot Pkt Lost : 0 Tot Packets : 10629 Tot Pkt Lost : 0 Tot Packets : 7125 Tot Pkt Lost : 0 Tot Packets : 8788 Tot Pkt Lost : 0 Tot Packets : 33134 Tot Pkt Lost : 0

Second run

Tot Packets : 23718 Tot Pkt Lost : 0 Tot Packets : 7935 Tot Pkt Lost : 0 Tot Packets : 8587 Tot Pkt Lost : 0 Tot Packets : 21133 Tot Pkt Lost : 0 Tot Packets : 10764 Tot Pkt Lost : 0 Tot Packets : 38644 Tot Pkt Lost : 0 Tot Packets : 21841 Tot Pkt Lost : 0 Tot Packets : 9139 Tot Pkt Lost : 0 Tot Packets : 22077 Tot Pkt Lost : 0 Tot Packets : 10679 Tot Pkt Lost : 0 Tot Packets : 7382 Tot Pkt Lost : 0 Tot Packets : 10629 Tot Pkt Lost : 0 Tot Packets : 7125 Tot Pkt Lost : 0 Tot Packets : 8788 Tot Pkt Lost : 0 Tot Packets : 33134 Tot Pkt Lost : 0

ifconfig_ens1f0.txt ens1f0 Link encap:Ethernet HWaddr XXXXXXXXX
UP BROADCAST RUNNING PROMISC MULTICAST MTU:9000 Metric:1 RX packets:59536753823 errors:0 dropped:1630116 overruns:590 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:29364560181158 (29.3 TB) TX bytes:0 (0.0 B) Memory:de700000-de7fffff

Second run

ens1f0 Link encap:Ethernet HWaddr XXXXXXXX UP BROADCAST RUNNING PROMISC MULTICAST MTU:9000 Metric:1 RX packets:59537228472 errors:0 dropped:1630128 overruns:590 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:29364736380776 (29.3 TB) TX bytes:0 (0.0 B) Memory:de700000-de7fffff

cardigliano commented 5 years ago

@thelamb got it, thank you for the information. Let's see what happens with pfcount, however it seems to be somehow related to the NIDS. Does it implement multithreaded capture? Btw, what card/driver model are you using?

thelamb commented 5 years ago

The capture in nids is multithreaded, in the pfdump-based tool it's a single thread.

however it seems to be somehow related to the NIDS

Can you elaborate what you base this on?

Btw, what card/driver model are you using?

It's an I350 card, but I don't know the exact model. We have already ruled out a hardware fault by replacing the entire server.

cardigliano commented 5 years ago

Do you have some locking when calling pfring_recv()? Or are you passing the PF_RING_REENTRANT flag to pfring_open()? Please note that without PF_RING_REENTRANT, concurrent calls to pfring_recv might lead to corruptions.

thelamb commented 5 years ago

There is no lock around pfring_recv, but each thread calls pfring_open itself and passes its unique handle to pfring_recv. So pfring_recv is never called from different threads with the same handle.

Not sure why we did it this way (choice made 9 years ago and that hasn't changed since then).

cardigliano commented 5 years ago

@thelamb I guess you are using multiple sockets configuring kernel clustering, that makes sense. Please let me know about the pfcount test. If a single pfcount instance is working, it would be interesting to check with multiple pfcount instances with clustering (-c ). Thank you.

cardigliano commented 5 years ago

@thelamb any update about the test with pfcount? Thank you.

cardigliano commented 5 years ago

CLosing for inactivity and inability to reproduce, please reopen in case.