ntop / ntopng

Web-based Traffic and Security Network Traffic Monitoring
http://www.ntop.org
GNU General Public License v3.0
6.27k stars 656 forks source link

NtopNG ZC interface queue goes to 100% CPU and starts dropping all packets #8196

Closed DerRealKeyser closed 8 months ago

DerRealKeyser commented 9 months ago

Environment:

What happened: I have a 10 Core XEON Server with two 10Gbe interfaces used for recieving a mirror of a MC-LAG in a Aruba VSX Core stack - one interface recieves one switches traffic on the crosschassis MC-LAG, the other interface the other switches traffic.

The two interfaces are both Intel ixgbe running Zero Copy with RSS=4 Starting NtopNG with 8 virtual interface queues (one for each RSS on each 10Gbe port), and a view:all interface to actually see everything works beautifully for a while - everything behaves exactly as expected, and serverload is very very low. Here's the config:

NTOPNG Config: root@ntopng:/var/log# cat /etc/ntopng/ntopng.conf -x=200000 -X=500000 -G=/var/run/ntopng.pid -i=zc:ens1f0@0 -i=zc:ens1f0@1 -i=zc:ens1f0@2 -i=zc:ens1f0@3 -i=zc:ens1f1@0 -i=zc:ens1f1@1 -i=zc:ens1f1@2 -i=zc:ens1f1@3 -i=view:all -m=10.0.0.0/8,172.16.0.0/12,192.168.200.0/24,192.168.204.0/24 --ignore-vlans --ignore-macs --capture-direction 1 -F="clickhouse;127.0.0.1;ntopng;default;xxxxxxxxxxxxx"

But after between 2 - 15 min I will start loosing all packets on ususally one, but sometimes two, of the virtual interfaces. and a htop shows one CPU core persistent at 100%. If two interfaces starts dropping traffic, two CPU cores goes to 100%. After that NtopNG keeps running (and seems to work), except it is missing traffic from one or two queues, and are thus not reporting/logging everything. When an interfaces starts dropping all packets, It also seems NtopNG almost stops logging an otherwise frequently reoccuring error in ntopng.log:

Fx. Here is the log from the start until at 12:03:28 where I loose two interfaces (two threads at a 100%). Only one more Modbus line is logged in the next 40 minutes until i shut down down NtopNG with sudo systemctl stop ntopng. Also notice it takes a long time to shut down NtopNG, and in the log there are no summaries and statistics for the two interface queues that dropped all traffic - It's like they went missing - actually the third one went away in this example also. Oddly enough it always seems to be the same (two) interface queues that starts hanging and goes away: -i=zc:ens1f1@1 and -i=zc:ens1f1@2

NTOPNG.LOG Jan 31 11:52:07 ntopng ntopng[9473]: 31/Jan/2024 11:52:07 [startup.lua:242] Completed startup.lua Jan 31 11:52:07 ntopng ntopng[9473]: 31/Jan/2024 11:52:07 [PeriodicActivities.cpp:167] Found 10 activities Jan 31 11:52:07 ntopng ntopng: 31/Jan/2024 11:52:07 [FlowChecksLoader.cpp:290] WARNING: Unable to find flow check 'udp_unidirectional': skipping it Jan 31 11:52:07 ntopng ntopng[9473]: 31/Jan/2024 11:52:07 [FlowChecksLoader.cpp:290] WARNING: Unable to find flow check 'udp_unidirectional': skipping it Jan 31 11:52:07 ntopng ntopng: 31/Jan/2024 11:52:07 [FlowChecksLoader.cpp:290] WARNING: Unable to find flow check 'tcp_issues_generic': skipping it Jan 31 11:52:07 ntopng ntopng[9473]: 31/Jan/2024 11:52:07 [FlowChecksLoader.cpp:290] WARNING: Unable to find flow check 'tcp_issues_generic': skipping it Jan 31 11:52:07 ntopng ntopng[9473]: 31/Jan/2024 11:52:07 [NetworkInterface.cpp:3719] Started packet polling on interface zc:ens1f0@0 [id: 33]... Jan 31 11:52:07 ntopng ntopng[9473]: 31/Jan/2024 11:52:07 [NetworkInterface.cpp:3719] Started packet polling on interface zc:ens1f0@1 [id: 34]... Jan 31 11:52:07 ntopng ntopng[9473]: 31/Jan/2024 11:52:07 [NetworkInterface.cpp:3719] Started packet polling on interface zc:ens1f0@2 [id: 35]... Jan 31 11:52:07 ntopng ntopng[9473]: 31/Jan/2024 11:52:07 [NetworkInterface.cpp:3719] Started packet polling on interface zc:ens1f0@3 [id: 36]... Jan 31 11:52:07 ntopng ntopng[9473]: 31/Jan/2024 11:52:07 [NetworkInterface.cpp:3719] Started packet polling on interface zc:ens1f1@0 [id: 37]... Jan 31 11:52:07 ntopng ntopng[9473]: 31/Jan/2024 11:52:07 [NetworkInterface.cpp:3719] Started packet polling on interface zc:ens1f1@1 [id: 38]... Jan 31 11:52:07 ntopng ntopng[9473]: 31/Jan/2024 11:52:07 [NetworkInterface.cpp:3719] Started packet polling on interface zc:ens1f1@2 [id: 39]... Jan 31 11:52:07 ntopng ntopng[9473]: 31/Jan/2024 11:52:07 [NetworkInterface.cpp:3719] Started packet polling on interface zc:ens1f1@3 [id: 40]... Jan 31 11:52:07 ntopng ntopng[9473]: 31/Jan/2024 11:52:07 [NetworkInterface.cpp:3719] Started packet polling on interface view:all [id: 13]... Jan 31 11:52:25 ntopng ntopng[9473]: 31/Jan/2024 11:52:25 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 2] Jan 31 11:52:34 ntopng ntopng[9473]: 31/Jan/2024 11:52:34 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 4] Jan 31 11:53:26 ntopng ntopng[9473]: 31/Jan/2024 11:53:26 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 6] Jan 31 11:53:36 ntopng ntopng[9473]: 31/Jan/2024 11:53:36 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 8] Jan 31 11:54:28 ntopng ntopng[9473]: 31/Jan/2024 11:54:28 [ModbusStats.cpp:101] [Modbus/TCP] Invalid packet [30 vs 35] Jan 31 11:54:28 ntopng ntopng[9473]: 31/Jan/2024 11:54:28 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 10] Jan 31 11:54:36 ntopng ntopng[9473]: 31/Jan/2024 11:54:36 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 12] Jan 31 11:55:28 ntopng ntopng[9473]: 31/Jan/2024 11:55:28 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 14] Jan 31 11:55:37 ntopng ntopng[9473]: 31/Jan/2024 11:55:37 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 16] Jan 31 11:56:39 ntopng ntopng[9473]: 31/Jan/2024 11:56:39 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 18] Jan 31 11:57:02 ntopng ntopng[9473]: 31/Jan/2024 11:57:02 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 20] Jan 31 11:57:41 ntopng ntopng[9473]: 31/Jan/2024 11:57:41 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 22] Jan 31 11:58:02 ntopng ntopng[9473]: 31/Jan/2024 11:58:02 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 24] Jan 31 11:58:43 ntopng ntopng[9473]: 31/Jan/2024 11:58:43 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 26] Jan 31 11:59:03 ntopng ntopng[9473]: 31/Jan/2024 11:59:03 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 28] Jan 31 11:59:50 ntopng ntopng[9473]: 31/Jan/2024 11:59:50 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 30] Jan 31 12:00:06 ntopng ntopng[9473]: 31/Jan/2024 12:00:06 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 32] Jan 31 12:00:52 ntopng ntopng[9473]: 31/Jan/2024 12:00:52 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 34] Jan 31 12:01:08 ntopng ntopng[9473]: 31/Jan/2024 12:01:08 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 36] Jan 31 12:01:56 ntopng ntopng[9473]: 31/Jan/2024 12:01:56 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 38] Jan 31 12:02:08 ntopng ntopng[9473]: 31/Jan/2024 12:02:08 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 40] Jan 31 12:02:59 ntopng ntopng[9473]: 31/Jan/2024 12:02:59 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 42] Jan 31 12:03:09 ntopng ntopng[9473]: 31/Jan/2024 12:03:09 [ModbusStats.cpp:313] [Modbus/TCP] 4 bytes leftover [num_pkts: 43] Jan 31 12:03:09 ntopng ntopng[9473]: 31/Jan/2024 12:03:09 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 44] Jan 31 12:03:19 ntopng ntopng[9473]: 31/Jan/2024 12:03:19 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 54] Jan 31 12:03:19 ntopng ntopng[9473]: 31/Jan/2024 12:03:19 [ModbusStats.cpp:313] [Modbus/TCP] 1 bytes leftover [num_pkts: 59] Jan 31 12:03:19 ntopng ntopng[9473]: 31/Jan/2024 12:03:19 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 63] Jan 31 12:03:26 ntopng ntopng[9473]: 31/Jan/2024 12:03:26 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 66] Jan 31 12:03:26 ntopng ntopng[9473]: 31/Jan/2024 12:03:26 [ModbusStats.cpp:313] [Modbus/TCP] 11 bytes leftover [num_pkts: 68] Jan 31 12:03:26 ntopng ntopng[9473]: 31/Jan/2024 12:03:26 [ModbusStats.cpp:313] [Modbus/TCP] 11 bytes leftover [num_pkts: 70] Jan 31 12:03:26 ntopng ntopng[9473]: 31/Jan/2024 12:03:26 [ModbusStats.cpp:313] [Modbus/TCP] 11 bytes leftover [num_pkts: 72] Jan 31 12:33:46 ntopng ntopng[9473]: 31/Jan/2024 12:33:46 [ModbusStats.cpp:313] [Modbus/TCP] 3 bytes leftover [num_pkts: 104] Jan 31 12:38:06 ntopng ntopng[9473]: 31/Jan/2024 12:38:06 [main.cpp:49] Shutting down... Jan 31 12:38:06 ntopng ntopng[9473]: 31/Jan/2024 12:38:06 [PF_RINGInterface.cpp:303] Terminated packet polling for zc:ens1f0@2 Jan 31 12:38:06 ntopng ntopng[9473]: 31/Jan/2024 12:38:06 [ViewInterface.cpp:787] Flow dump thread completed for view:all Jan 31 12:38:06 ntopng ntopng[9473]: 31/Jan/2024 12:38:06 [PF_RINGInterface.cpp:303] Terminated packet polling for zc:ens1f0@1 Jan 31 12:38:06 ntopng ntopng[9473]: 31/Jan/2024 12:38:06 [PF_RINGInterface.cpp:303] Terminated packet polling for zc:ens1f1@0 Jan 31 12:38:06 ntopng ntopng[9473]: 31/Jan/2024 12:38:06 [PF_RINGInterface.cpp:303] Terminated packet polling for zc:ens1f0@0 Jan 31 12:38:06 ntopng ntopng[9473]: 31/Jan/2024 12:38:06 [PF_RINGInterface.cpp:303] Terminated packet polling for zc:ens1f1@3 Jan 31 12:38:06 ntopng ntopng[9473]: 31/Jan/2024 12:38:06 [PF_RINGInterface.cpp:303] Terminated packet polling for zc:ens1f0@3 Jan 31 12:38:06 ntopng ntopng[9473]: 31/Jan/2024 12:38:06 [main.cpp:478] Terminating... Jan 31 12:38:06 ntopng ntopng[9473]: 31/Jan/2024 12:38:06 [PeriodicActivities.cpp:98] Terminated periodic activites... Jan 31 12:38:10 ntopng ntopng[9473]: 31/Jan/2024 12:38:10 [ProtoStats.cpp:34] [IPv4] 263.63 GB/302.78 M Packets Jan 31 12:38:10 ntopng ntopng[9473]: 31/Jan/2024 12:38:10 [ProtoStats.cpp:34] [IPv6] 0 B/0.00 Packets Jan 31 12:38:10 ntopng ntopng[9473]: 31/Jan/2024 12:38:10 [ProtoStats.cpp:34] [ARP] 0 B/0.00 Packets Jan 31 12:38:10 ntopng ntopng[9473]: 31/Jan/2024 12:38:10 [ProtoStats.cpp:34] [MPLS] 0 B/0.00 Packets Jan 31 12:38:10 ntopng ntopng[9473]: 31/Jan/2024 12:38:10 [ProtoStats.cpp:34] [Other] 0 B/0.00 Packets Jan 31 12:38:11 ntopng ntopng[9473]: 31/Jan/2024 12:38:10 [NetworkInterface.cpp:3516] Flow alerts dump thread terminated for view:all Jan 31 12:38:11 ntopng ntopng[9473]: 31/Jan/2024 12:38:10 [NetworkInterface.cpp:3579] Host alerts dump thread terminated for view:all Jan 31 12:38:11 ntopng ntopng[9473]: 31/Jan/2024 12:38:10 [Ntop.cpp:3006] Polling shut down [interface: view:all] Jan 31 12:38:11 ntopng ntopng[9473]: 31/Jan/2024 12:38:10 [ProtoStats.cpp:34] [IPv4] 34.59 GB/34.20 M Packets Jan 31 12:38:11 ntopng ntopng[9473]: 31/Jan/2024 12:38:10 [ProtoStats.cpp:34] [IPv6] 0 B/0.00 Packets Jan 31 12:38:11 ntopng ntopng[9473]: 31/Jan/2024 12:38:10 [ProtoStats.cpp:34] [ARP] 18.70 KB/300.00 Packets Jan 31 12:38:11 ntopng ntopng[9473]: 31/Jan/2024 12:38:10 [ProtoStats.cpp:34] [MPLS] 0 B/0.00 Packets Jan 31 12:38:11 ntopng ntopng[9473]: 31/Jan/2024 12:38:10 [ProtoStats.cpp:34] [Other] 12.40 KB/92.00 Packets Jan 31 12:38:12 ntopng ntopng[9473]: 31/Jan/2024 12:38:11 [NetworkInterface.cpp:3516] Flow alerts dump thread terminated for zc:ens1f0@0 Jan 31 12:38:12 ntopng ntopng[9473]: 31/Jan/2024 12:38:11 [NetworkInterface.cpp:3579] Host alerts dump thread terminated for zc:ens1f0@0 Jan 31 12:38:12 ntopng ntopng[9473]: 31/Jan/2024 12:38:11 [Ntop.cpp:3019] Polling shut down [interface: zc:ens1f0@0] Jan 31 12:38:12 ntopng ntopng[9473]: 31/Jan/2024 12:38:11 [ProtoStats.cpp:34] [IPv4] 38.29 GB/35.85 M Packets Jan 31 12:38:12 ntopng ntopng[9473]: 31/Jan/2024 12:38:11 [ProtoStats.cpp:34] [IPv6] 0 B/0.00 Packets Jan 31 12:38:12 ntopng ntopng[9473]: 31/Jan/2024 12:38:11 [ProtoStats.cpp:34] [ARP] 0 B/0.00 Packets Jan 31 12:38:12 ntopng ntopng[9473]: 31/Jan/2024 12:38:11 [ProtoStats.cpp:34] [MPLS] 0 B/0.00 Packets Jan 31 12:38:12 ntopng ntopng[9473]: 31/Jan/2024 12:38:11 [ProtoStats.cpp:34] [Other] 91 B/1.00 Packets Jan 31 12:38:13 ntopng ntopng[9473]: 31/Jan/2024 12:38:12 [NetworkInterface.cpp:3579] Host alerts dump thread terminated for zc:ens1f0@1 Jan 31 12:38:13 ntopng ntopng[9473]: 31/Jan/2024 12:38:12 [NetworkInterface.cpp:3516] Flow alerts dump thread terminated for zc:ens1f0@1 Jan 31 12:38:13 ntopng ntopng[9473]: 31/Jan/2024 12:38:12 [Ntop.cpp:3019] Polling shut down [interface: zc:ens1f0@1] Jan 31 12:38:13 ntopng ntopng[9473]: 31/Jan/2024 12:38:12 [ProtoStats.cpp:34] [IPv4] 38.07 GB/36.12 M Packets Jan 31 12:38:13 ntopng ntopng[9473]: 31/Jan/2024 12:38:12 [ProtoStats.cpp:34] [IPv6] 0 B/0.00 Packets Jan 31 12:38:13 ntopng ntopng[9473]: 31/Jan/2024 12:38:12 [ProtoStats.cpp:34] [ARP] 0 B/0.00 Packets Jan 31 12:38:13 ntopng ntopng[9473]: 31/Jan/2024 12:38:12 [ProtoStats.cpp:34] [MPLS] 0 B/0.00 Packets Jan 31 12:38:13 ntopng ntopng[9473]: 31/Jan/2024 12:38:12 [ProtoStats.cpp:34] [Other] 0 B/0.00 Packets Jan 31 12:38:14 ntopng ntopng[9473]: 31/Jan/2024 12:38:13 [NetworkInterface.cpp:3516] Flow alerts dump thread terminated for zc:ens1f0@2 Jan 31 12:38:14 ntopng ntopng[9473]: 31/Jan/2024 12:38:13 [NetworkInterface.cpp:3579] Host alerts dump thread terminated for zc:ens1f0@2 Jan 31 12:38:14 ntopng ntopng[9473]: 31/Jan/2024 12:38:13 [Ntop.cpp:3019] Polling shut down [interface: zc:ens1f0@2] Jan 31 12:38:14 ntopng ntopng[9473]: 31/Jan/2024 12:38:13 [ProtoStats.cpp:34] [IPv4] 34.95 GB/33.31 M Packets Jan 31 12:38:14 ntopng ntopng[9473]: 31/Jan/2024 12:38:13 [ProtoStats.cpp:34] [IPv6] 0 B/0.00 Packets Jan 31 12:38:14 ntopng ntopng[9473]: 31/Jan/2024 12:38:13 [ProtoStats.cpp:34] [ARP] 0 B/0.00 Packets Jan 31 12:38:14 ntopng ntopng[9473]: 31/Jan/2024 12:38:13 [ProtoStats.cpp:34] [MPLS] 0 B/0.00 Packets Jan 31 12:38:14 ntopng ntopng[9473]: 31/Jan/2024 12:38:13 [ProtoStats.cpp:34] [Other] 0 B/0.00 Packets Jan 31 12:38:15 ntopng ntopng[9473]: 31/Jan/2024 12:38:14 [NetworkInterface.cpp:3579] Host alerts dump thread terminated for zc:ens1f0@3 Jan 31 12:38:15 ntopng ntopng[9473]: 31/Jan/2024 12:38:14 [NetworkInterface.cpp:3516] Flow alerts dump thread terminated for zc:ens1f0@3 Jan 31 12:38:15 ntopng ntopng[9473]: 31/Jan/2024 12:38:14 [Ntop.cpp:3019] Polling shut down [interface: zc:ens1f0@3] Jan 31 12:38:15 ntopng ntopng[9473]: 31/Jan/2024 12:38:14 [ProtoStats.cpp:34] [IPv4] 49.11 GB/72.32 M Packets Jan 31 12:38:15 ntopng ntopng[9473]: 31/Jan/2024 12:38:14 [ProtoStats.cpp:34] [IPv6] 0 B/0.00 Packets Jan 31 12:38:15 ntopng ntopng[9473]: 31/Jan/2024 12:38:14 [ProtoStats.cpp:34] [ARP] 26.60 KB/426.00 Packets Jan 31 12:38:15 ntopng ntopng[9473]: 31/Jan/2024 12:38:14 [ProtoStats.cpp:34] [MPLS] 0 B/0.00 Packets Jan 31 12:38:15 ntopng ntopng[9473]: 31/Jan/2024 12:38:14 [ProtoStats.cpp:34] [Other] 12.40 KB/92.00 Packets Jan 31 12:38:16 ntopng ntopng[9473]: 31/Jan/2024 12:38:15 [NetworkInterface.cpp:3579] Host alerts dump thread terminated for zc:ens1f1@0 Jan 31 12:38:16 ntopng ntopng[9473]: 31/Jan/2024 12:38:15 [NetworkInterface.cpp:3516] Flow alerts dump thread terminated for zc:ens1f1@0 Jan 31 12:38:16 ntopng ntopng[9473]: 31/Jan/2024 12:38:15 [Ntop.cpp:3019] Polling shut down [interface: zc:ens1f1@0] Jan 31 12:38:16 ntopng ntopng[9473]: 31/Jan/2024 12:38:15 [ProtoStats.cpp:34] [IPv4] 8.67 GB/11.77 M Packets Jan 31 12:38:16 ntopng ntopng[9473]: 31/Jan/2024 12:38:15 [ProtoStats.cpp:34] [IPv6] 0 B/0.00 Packets Jan 31 12:38:16 ntopng ntopng[9473]: 31/Jan/2024 12:38:15 [ProtoStats.cpp:34] [ARP] 0 B/0.00 Packets Jan 31 12:38:16 ntopng ntopng[9473]: 31/Jan/2024 12:38:15 [ProtoStats.cpp:34] [MPLS] 0 B/0.00 Packets Jan 31 12:38:16 ntopng ntopng[9473]: 31/Jan/2024 12:38:15 [ProtoStats.cpp:34] [Other] 0 B/0.00 Packets root@ntopng:/var/log#

How did you reproduce it?

The same thing happes at every restart - but the timing can be very different. Somtimes it takes a minute, sometimes 15 before it happens. Everything is up to date - just did a "sudo apt update and later upgrade" to make sure everything is at the latest version.

I did notice that once it starts happening the drop will appear on the affected interfaces (and the view:all), and briefly after that the graphics will only show the alert counters and flow counters, but not the local and remote hosts counters. Could this be related?

Any idea what might be causing this?

DerRealKeyser commented 9 months ago

image

Perhaps I should mention that throughput/load seems not to influence when it happens. It happens at night under no load/throughput as well.

DerRealKeyser commented 9 months ago

Hmm, further to this case: It seems it might actually not be NtopNG related but rather pf_ring or hardware.

If i let a continious pfcount -i zc:ens1f1@1 run, it will at about the same random 2-15min intervals show a little packet loss brefly. But unlike NtopNG it will continue recieving packets and show their stats - it does not drop everything going forward like NtopNG.

Can this be a hardware issue?

cardigliano commented 9 months ago

@DerRealKeyser please provide:

DerRealKeyser commented 9 months ago

ntopngadmin@ntopng:/var/log$ cat /proc/net/pf_ring/dev/ens1f1/info Name: ens1f1 Index: 9 Address: 48:DF:37:1E:47:ED Polling Mode: NAPI/ZC Promisc: Enabled Type: Ethernet Family: Intel ixgbe 82599 TX Queues: 4 RX Queues: 4 Num RX Slots: 8192 Num TX Slots: 8192 RX Slot Size: 1536 TX Slot Size: 1536

ntopngadmin@ntopng:/var/log$ ethtool -i ens1f1 driver: ixgbe version: 5.19.6 firmware-version: 0x800009e1, 1.3299.0 expansion-rom-version: bus-info: 0000:08:00.1 supports-statistics: yes supports-test: yes supports-eeprom-access: yes supports-register-dump: yes supports-priv-flags: yes

DerRealKeyser commented 9 months ago

I tried updating the NIc adapter firmware to the latest Intel Firmware 1.32.xxx That didn't elp at all. Now all the interfaces only show a small percentage og the actual packets on each interface. When 1Gbit is going through, NtopNg and the interfaces reports about 50Mbit combined.

This is just too buggy...

cardigliano commented 9 months ago

When you run pfcount on (all the queues on) the same interface, you are able to receive 1Gbps (with little packet loss you said) right? I suggest, if possible, to run a different configuration, in order to identify what is causing this:

DerRealKeyser commented 9 months ago

I have now done a host of tests.

1: Since my NIC adapter firmware update neither pfcount nor NtopNG reports anything remotely like the actual number of packets received on the interfaces (combined across queues). When the physical interfaces combined are receiving between 1 and 3 Gbit, pfcount and NtopNG reports around 250Mbit combined, fairly well distributed across the 8 queues (30 - 40Mbit each). And very interestingly: No packets lost (until the 2 - 15min event happens). So its just missing most of the packets all together now.

2: I have tried reconfiguring the ZC drivers for RSS=1, RSS=2 and RSS=4 and the combined numbers are the same - a small percentage of actual packets The behavior is also the same: 2 - 15 minuttes and a queue or THE queue (when doing RSS=1) on ens1f1 will start dropping all packets permanently in NtopNG (CPU Thread @ 100%). pfCount will show a slight packet loss, but continue on reporting new packet received (not dropped like NtopNG). In all of my tests so far it has always been a queue on ens1f1 that stops. Can i t be related to the 10Gbe DAC Cable used? I have tests with and without flow-control, that makes no difference.

3: Before my NIC firmware update, when it still reported true traffic/packet numbers, I was easily able to handle and analyze 4-5 Gbit traffic on the two physical interfaces combined in NtopNG. That was without any CPU core going to 100% or any packets being reported as dropped - right up until the "fallout" event between 2 - 15 minutes :-)

4: Before going ZC and just using the regular kernel drivers, I had the same issue with a Queue or the whole ens1f1 interface starting to report all packets dropped after a while in NtopNG. That's why I'm thinking perhaps a hardware or link, or link config issue?

DerRealKeyser commented 9 months ago

As a followup: I went ZC to increase the performance because in regular kernel mode I saw pretty high CPU usage on some cores when we were far from peak throughput, so i suspected the drop-out of the interface from being related to overload rather than an actual problem.

cardigliano commented 9 months ago

Definitely a strange behaviour. Are you saying that pfcount is receiving all traffic for 2 - 15min, then reporting 250Mbit combined after 2 - 15min? Not sure I got this right.

DerRealKeyser commented 9 months ago

Not quite... Before my firmware update pfCount and NtopNG both reported the actual "true" packet count for the two interfaces combined (across 8 queues). Even when the "fallout" ocurred, and some queues started dropping all packets, the packet count was still correct (it counted the dropped packets correctly). Now with the new firmware the packetcount and throughputnumber er at a percentage of the actual counts. Most packets are simply not seen/counted - not even as dropped. When the "fallout" occures, the problem remains the same. Packets are counted as dropped, but the count is just a percentage of the actual amount.

Here's how the dashboard looks after a night. The three queues with the drop, are all dropping ALL packets (reporting no throughput), The actual packet number are much higher that this dash leads you to believe.

image

DerRealKeyser commented 9 months ago

Here's one of the queues a long time after it had it's "fallout". It has a CPU thread spinning at 100%

image

And as you can see I have three cores spinning at 100% now - because 3 queues has fallen out: image

DerRealKeyser commented 9 months ago

A little follow up on this ticket: I'm now quite sure this is not hardware, but software (pf_ring/ntopng).

My first test was to change the uplink trancievers from each NIC port to my redundant switch stack. That didn't change anything. My second test was to swap the link cables in ens1f0,ens1f1 NIC ports to see if it was the NIC. That moved the droppes packets issue from ens1f1 til ens1f0 instead - behaving exactly like before, just on the other NIC port.

So it would seem it's some of the packets recieved from my switches that causes pf_ring/ntopng to malfunction (pf_ring by dropping some packets briefly , NtopNG to stop receiveing packets all together on that queue and send the thread to 100% CPU)

Any ideas what that might be caused by? The switches are two Aruba 8360 in a VSX availability stack, and the frames forwarded onto each ntopng link are the packets each switch sends/recieves on a their respective link in a shared multi-chassis LAG group (mirror of their respective port in the MC-LAG). Since the problem seems to "follow the sending switch", could it be related to how the primary or secondary switch in a VSX stack handles its half of a MC-LAG?

cardigliano commented 9 months ago

@DerRealKeyser I think I should provide a debug build and attach to the ntopng process as soon as this is reproduced. If that's ok with you, please drop me an email to arrange this (cardigliano at ntop.org)

DerRealKeyser commented 8 months ago

Alfredo identified this issue to be caused by a particular modbus traffic flow, and he then created a patch/fix and updated the public build to include it - all within a couple of days!. Everything works perfectly now, and I can only say I have never been so impressed with the speed or professionalism of any support organisation before (looking at you MS :-( )

DerRealKeyser commented 8 months ago

Ticket Closed