vel21ripn / nDPI

Open Source Deep Packet Inspection Software Toolkit
http://www.ntop.org
GNU Lesser General Public License v3.0
119 stars 59 forks source link

iptables 1.8.10 causes ksoftirqd 100% CPU #188

Closed netcons closed 7 months ago

netcons commented 7 months ago

Since iptables-nft 1.8.10, observing gradual CPU utilization growth to 100% by ksoftirqd process. Depending on load and hardware it takes about 30 minutes, throughput is then severely degraded until iptables-restore.

Tested on CentOS Steam 9 x86_64 with Kernel versions 6.1.78, 6.1.58, 6.1.12 and nDPI commits d5029c5, 90514cb, 9a6412b.

Downgrading to iptables-nft 1.8.8 , 1.8.7 and 1.8.1 resolves the issue.

Module loaded with: modprobe xt_ndpi ndpi_enable_flow=1 ndpi_flow_opt=SCFVR

vel21ripn commented 7 months ago

I am currently using iptables version 1.8.9 and do not experience this problem. I'll try to update iptables to version 1.8.10. I'm using iptables with its own patches, so it's not exactly fast or easy.

vel21ripn commented 7 months ago

Is it possible to run "perf record -F 99 -a -g --kernel-callchains -- sleep 20" when the CPU load reaches more than 50% and send the result "perf report"?

What hardware platform are you using nDPI on? How much traffic (approximately) goes through nDPI?

Can you check if there is a problem in iptables-1.8.10-legacy?

netcons commented 7 months ago

Thanks for your thoughts on this, will gather the information.

netcons commented 7 months ago

Apologies, we are unable to replicate this. Will record the perf info if we experience it again.

We had this on 300+ nodes with various hardware for 3 days. Reverting from iptables 1.8.10 to 1.8.8 resolved the issue instantly.

All LAN to WAN traffic goes through nDPI.

Probably something completely unrelated, thanks again!

P.S. Last week Wednesday's MS updates coincide with the timeline.

netcons commented 7 months ago

Looks like it was a third-party tool creating rules with no jump action, to account for traffic per port, LAN to WAN. Removing this tool or downgrading to 1.8.8 resolves the issue.

Snippet: elif [ $Proto = "tcp" ] || [ $Proto = "udp" ] &&\ [ $Type = "incoming" ] then TxRule="$OutChain -o $WanIf -p $Proto -m $Proto --sport $Port" RxRule="$InChain -i $WanIf -p $Proto -m $Proto --dport $Port" elif [ $Proto = "tcp" ] || [ $Proto = "udp" ] &&\ [ $Type = "outgoing" ] then TxRule="$OutChain -o $WanIf -p $Proto -m $Proto --dport $Port" RxRule="$InChain -i $WanIf -p $Proto -m $Proto --sport $Port" else