Increased CPU load limits 10G throughput on DEC850 since upgrade to 24.7

svengrun commented 1 month ago

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

[x] I have read the contributing guide lines at https://github.com/opnsense/core/blob/master/CONTRIBUTING.md
[x] I am convinced that my issue is new after having checked both open and closed issues at https://github.com/opnsense/core/issues?q=is%3Aissue

Describe the bug I see a noticeable performance impact on 10G routing after upgrading my DEC850 to 24.7. Going back to 24.1 solves the issue. On prior release (including 24.1) I was able to get "line" performance while having relatively low CPU load on the 8 cores the DEC850 has. Since upgrading to 24.7 the network performance is inconsistent and no longer line performance. CPU core spike at 100% during transfer tests.

To Reproduce Using a device connected to the LAN (ax1), run a moderate network load to a WAN (ax0) target that can provide consistent performance close to 10G line speed. For my example I used "iperf3 -c speedtest.sea11.us.leaseweb.net -p 5201-5210 -P10" which normally results in about 9.4 to 9.3 Gbits/sec and a well distributed load across CPU cores. 24 1

Running the same test on 24.7 results in only 8.8 Gbits/sec down and ~5 Gbits/sec up while maxing out most CPU cores. 24 7

Expected behavior The DEC850 should maintain the same 10G routing performance when upgrading to 24.7. CPU core should not spike to 100% system load.

Describe alternatives you considered Downgrade to 24.1 or 24.4 resolves the issue.

Environment I am on a 10G internet connection and use a DEC850. AX0 is the WAN port and AX1 is used as LAN. AX1 has several v-lans. No other ports are in use.

The recommended tuning parameters to get 10G performance are set: dev.ax.0.iflib.override_nrxds = 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048 dev.ax.0.iflib.override_ntxds = 2048, 2048, 2048, 2048, 2048, 2048, 2048, 2048 dev.ax.0.rss_enabled = 1

Same settings for ax.1 as well. All CPU bug mitigations are turned off.

OPNsense 24.7(amd64). DEC850 (stock 8 core configuration)

AdSchellevis commented 1 month ago

can you share the output of a detailed top (at least from 24.7, but both would even be better)?

top -CHIPSn -d2 9999

Some netstat statistics might be practical to have as well:

netstat -Q
netstat -m
netstat -s

I'm also a bit curious if the results vary when changing the number of flows in iperf3 (e.g. -P 4, -P 6, -P 8)

A lot has changed upstream in FreeBSD 14.1, some variations are not unusual.

svengrun commented 1 month ago

This is an interesting bug. I created a lab environment using a fresh install on a DEC850 to better debug this issue and remove any WAN speed variabilities or other factors. WAN side is connected to a server running speedtest -s and LAN side is connected to an iPAD Pro M4 connected to 10G via a Thunderbolt NIC. This combination gives me 100% solid transfer speeds at expected line speeds when the issue is not happening.

I also created a sample configuration file for the DEC850 that I am using for all tests which should make it easy to reproduce the issue. This uses the default password. test-config.xml.txt

My first tests showed full performance and made me doubt that there is an issue. However, after repeated tests the CPU load was increasing, and speeds were dropping using the exact same speedtest parameters.

I captured the requested details for various speedtest "P" settings while the issue was present:

P 1 top -CHIPSn -d2 9999: P1.txt netstat -m: NM1.txt netstat -Q: NQ1.txt netstat -s: NS1.txt

P 5 top -CHIPSn -d2 9999: P5.txt netstat -m: NM5.txt netstat -Q: NQ5.txt netstat -s: NS5.txt

P 10 top -CHIPSn -d2 9999: P10.txt netstat -m: NM10.txt netstat -Q: NQ10.txt netstat -s: NS10.txt

P 20 top -CHIPSn -d2 9999: P20.txt netstat -m: NM20.txt netstat -Q: NQ20.txt netstat -s: NS20.txt

Rebooted the router and got the speed back temporarily before it slowed down again. The logs at expected speed are here:

P 20 after reboot at expected 9.41 Gbits/sec top -CHIPSn -d2 9999: R-P20.txt netstat -m: R-NM20.txt netstat -Q: R-NQ20.txt netstat -s: R-NS20.txt

Note: I am not sure that a reboot is the only way to "cure" this. I had a series of low-speed test results associated with high CPU loads but changing the P parameter to a lower value (3) for one run did at least once increase the run at 20.

Let me know what other tests you want me to run.

svengrun commented 1 month ago

I just noticed looking at the Px.txt files that the load shown does not reflect the same as shown with top -P during the run. During the test runs the cores were at 100%. Any other way to capture that better? The run after the reboot shows the correct core loads.

AdSchellevis commented 1 month ago

To be honest, with iperf3 tests I've always seen variations during testing, partly caused by congestion and partly misalignment of flows (ideally with an 8 core system, 8 flows should be distributed evenly, but in reality that's not always the case for various reasons).

When I'm at the office, I'll connect one of our machines between our standard testbed to see if there's anything unusual. Just one question, your comments seem to suggest cpu consumption increases over time without reboots, is that indeed the case?

svengrun commented 1 month ago

Yes, CPU consumption is lower right after a reboot, and I also see expected speed right after a reboot. After a while it slows down. Here are test runs with -P 8.

In the "fault" state is only get 8.64 Gbits/sec and CPU cores spike to 100%:

last pid: 87053; load averages: 3.11, 1.04, 0.50 up 1+01:11:40 00:06:22 349 threads: 21 running, 294 sleeping, 34 waiting CPU 0: 0.0% user, 0.0% nice, 100% system, 0.0% interrupt, 0.0% idle CPU 1: 0.4% user, 0.0% nice, 11.4% system, 5.1% interrupt, 83.1% idle CPU 2: 0.0% user, 0.0% nice, 100% system, 0.0% interrupt, 0.0% idle CPU 3: 0.0% user, 0.0% nice, 82.0% system, 0.0% interrupt, 18.0% idle CPU 4: 0.0% user, 0.0% nice, 100% system, 0.0% interrupt, 0.0% idle CPU 5: 0.0% user, 0.0% nice, 100% system, 0.0% interrupt, 0.0% idle CPU 6: 0.0% user, 0.0% nice, 100% system, 0.0% interrupt, 0.0% idle CPU 7: 0.0% user, 0.0% nice, 100% system, 0.0% interrupt, 0.0% idle Mem: 80M Active, 216M Inact, 1693M Wired, 56K Buf, 13G Free ARC: 1068M Total, 91M MFU, 833M MRU, 1684K Anon, 20M Header, 120M Other 840M Compressed, 3846M Uncompressed, 4.58:1 Ratio Swap: 8192M Total, 8192M Free

PID USERNAME PRI NICE SIZE RES STATE C TIME CPU COMMAND 0 root -60 - 0B 2144K CPU4 4 1:50 100.00% kernel{if_io 0 root -60 - 0B 2144K CPU7 7 1:21 100.00% kernel{if_io 0 root -60 - 0B 2144K CPU2 2 1:20 100.00% kernel{if_io 0 root -60 - 0B 2144K CPU5 5 1:04 100.00% kernel{if_io 0 root -60 - 0B 2144K CPU0 0 0:57 100.00% kernel{if_io 0 root -60 - 0B 2144K CPU6 6 0:46 100.00% kernel{if_io 11 root 187 ki31 0B 128K RUN 1 25.1H 86.06% idle{idle: c 0 root -60 - 0B 2144K CPU3 3 1:38 82.33% kernel{if_io

After a reboot, running the exact same test again gives me higher speeds (9.4) and lower CPU load:

last pid: 79445; load averages: 1.22, 0.34, 0.13 up 0+00:01:26 00:11:18 347 threads: 10 running, 298 sleeping, 39 waiting CPU 0: 0.0% user, 0.0% nice, 7.1% system, 0.0% interrupt, 92.9% idle CPU 1: 0.0% user, 0.0% nice, 76.0% system, 0.0% interrupt, 24.0% idle CPU 2: 0.0% user, 0.0% nice, 69.3% system, 0.0% interrupt, 30.7% idle CPU 3: 0.0% user, 0.0% nice, 0.4% system, 2.4% interrupt, 97.2% idle CPU 4: 0.0% user, 0.0% nice, 38.6% system, 0.0% interrupt, 61.4% idle CPU 5: 0.0% user, 0.0% nice, 39.8% system, 0.0% interrupt, 60.2% idle CPU 6: 0.0% user, 0.0% nice, 41.7% system, 0.0% interrupt, 58.3% idle CPU 7: 0.0% user, 0.0% nice, 35.0% system, 0.0% interrupt, 65.0% idle Mem: 141M Active, 125M Inact, 581M Wired, 56K Buf, 15G Free ARC: 199M Total, 45M MFU, 108M MRU, 11M Anon, 1371K Header, 34M Other 113M Compressed, 235M Uncompressed, 2.09:1 Ratio Swap: 8192M Total, 8192M Free

PID USERNAME PRI NICE SIZE RES STATE C TIME CPU COMMAND 11 root 187 ki31 0B 128K RUN 3 1:24 98.17% idle{idle: c 11 root 187 ki31 0B 128K CPU0 0 1:24 94.64% idle{idle: c 0 root -60 - 0B 2144K - 1 0:11 73.00% kernel{if_io 0 root -60 - 0B 2144K CPU2 2 0:10 69.28% kernel{if_io 11 root 187 ki31 0B 128K RUN 4 1:13 65.14% idle{idle: c 11 root 187 ki31 0B 128K CPU7 7 1:18 64.77% idle{idle: c 11 root 187 ki31 0B 128K CPU5 5 1:18 61.57% idle{idle: c 11 root 187 ki31 0B 128K RUN 6 1:17 57.16% idle{idle: c

svengrun commented 1 month ago

... and just for fun. Here is a run in the "none fault" state with -P 100 pushing 9.40 Gbits/sec with relatively low CPU load. Does not look like a congestion issue to me. At least it runs fine if the router does not hit this bug and the below performance is the same as I saw on prior releases.

last pid: 15071; load averages: 1.86, 1.09, 0.57 up 0+00:10:47 00:20:40 346 threads: 12 running, 295 sleeping, 39 waiting CPU 0: 0.0% user, 0.0% nice, 51.2% system, 0.4% interrupt, 48.4% idle CPU 1: 0.0% user, 0.0% nice, 44.1% system, 0.0% interrupt, 55.9% idle CPU 2: 0.0% user, 0.0% nice, 47.6% system, 0.0% interrupt, 52.4% idle CPU 3: 0.0% user, 0.0% nice, 44.9% system, 0.0% interrupt, 55.1% idle CPU 4: 0.0% user, 0.0% nice, 43.7% system, 0.0% interrupt, 56.3% idle CPU 5: 0.0% user, 0.0% nice, 38.6% system, 0.0% interrupt, 61.4% idle CPU 6: 0.4% user, 0.0% nice, 42.5% system, 0.0% interrupt, 57.1% idle CPU 7: 0.0% user, 0.0% nice, 42.9% system, 0.0% interrupt, 57.1% idle Mem: 139M Active, 126M Inact, 635M Wired, 56K Buf, 14G Free ARC: 189M Total, 41M MFU, 112M MRU, 516K Anon, 1363K Header, 34M Other 116M Compressed, 256M Uncompressed, 2.20:1 Ratio Swap: 8192M Total, 8192M Free ax1: Link is UP - 10Gbps/Full - flow control off ax1: Link is DOWN PRI NICE SIZE RES STATE C TIME CPU COMMAND 11 root 187 ki31 0B 128K RUN 6 9:48 58.93% idle{idle: c 11 root 187 ki31 0B 128K RUN 3 9:49 58.48% idle{idle: c 11 root 187 ki31 0B 128K RUN 4 9:38 58.40% idle{idle: c 11 root 187 ki31 0B 128K CPU5 5 9:27 57.62% idle{idle: c 11 root 187 ki31 0B 128K CPU7 7 9:37 57.44% idle{idle: c 11 root 187 ki31 0B 128K CPU1 1 9:37 55.85% idle{idle: c 11 root 187 ki31 0B 128K CPU2 2 9:36 54.03% idle{idle: c 11 root 187 ki31 0B 128K RUN 0 9:40 51.61% idle{idle: c

AdSchellevis commented 1 month ago

On my end with the same hardware it looks rather normal after clean boot:

[  5]  33.00-33.62  sec  87.0 MBytes  1.17 Gbits/sec    0    506 KBytes       
[  7]  33.00-33.62  sec  86.9 MBytes  1.17 Gbits/sec    0    493 KBytes       
[  9]  33.00-33.62  sec  87.1 MBytes  1.17 Gbits/sec    0    476 KBytes       
[ 11]  33.00-33.62  sec  87.5 MBytes  1.17 Gbits/sec    0    839 KBytes       
[ 13]  33.00-33.62  sec  87.5 MBytes  1.17 Gbits/sec    0    819 KBytes       
[ 15]  33.00-33.62  sec  87.5 MBytes  1.17 Gbits/sec    0    472 KBytes       
[ 17]  33.00-33.62  sec  88.1 MBytes  1.18 Gbits/sec    0    638 KBytes       
[ 19]  33.00-33.62  sec  86.9 MBytes  1.17 Gbits/sec    0    615 KBytes       
[SUM]  33.00-33.62  sec   698 MBytes  9.38 Gbits/sec    0

Usually I test with slightly smaller packets using the following iperf3 command, where the firewall is positioned between 2 linux hosts:

iperf3 -c  x.x.x.x -P 8 -Z -t 5000 -M 1350

Load distribution varies between tests (when stopping/starting), but thats to be expected. Some examples during a run:

352 threads:   15 running, 314 sleeping, 23 waiting
CPU 0:  0.0% user,  0.0% nice, 44.7% system,  0.0% interrupt, 55.3% idle
CPU 1:  0.0% user,  0.0% nice, 37.0% system,  0.0% interrupt, 63.0% idle
CPU 2:  0.0% user,  0.0% nice,  100% system,  0.0% interrupt,  0.0% idle
CPU 3:  0.4% user,  0.0% nice,  5.7% system,  0.0% interrupt, 93.9% idle
CPU 4:  0.0% user,  0.0% nice, 90.8% system,  0.0% interrupt,  9.2% idle
CPU 5:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU 6:  0.0% user,  0.0% nice, 96.6% system,  0.0% interrupt,  3.4% idle
CPU 7:  0.4% user,  0.0% nice,  4.6% system,  0.0% interrupt, 95.0% idle

353 threads:   13 running, 316 sleeping, 24 waiting
CPU 0:  0.6% user,  0.0% nice, 34.6% system,  0.0% interrupt, 64.8% idle
CPU 1:  0.6% user,  0.0% nice, 34.6% system,  0.0% interrupt, 64.8% idle
CPU 2:  0.5% user,  0.0% nice, 71.5% system,  0.0% interrupt, 28.1% idle
CPU 3:  0.9% user,  0.0% nice, 17.7% system,  0.0% interrupt, 81.3% idle
CPU 4:  0.6% user,  0.0% nice, 55.8% system,  0.0% interrupt, 43.6% idle
CPU 5:  1.2% user,  0.0% nice,  3.9% system,  0.0% interrupt, 95.0% idle
CPU 6:  0.4% user,  0.0% nice, 56.3% system,  0.0% interrupt, 43.3% idle
CPU 7:  0.9% user,  0.0% nice,  9.9% system,  0.0% interrupt, 89.2% idle

I'll keep it running for a while to see if there's anything unusual, but at a first glance it doesn't look very different on my end between versions.

svengrun commented 1 month ago

Your results match my test when the system is in the "normal" and not "fault, bug" state. I try to find repro steps to trigger the condition but have not found them, yet. Just had the router run "normal" for hours. Did a reboot and now cores spike to 100% running the same tests.

svengrun commented 1 month ago

I did more testing today and found that using lower "Ps" does show a significant difference in both performance and CPU load between 24.1 and 24.7. Loads spikes to 100% on several cores on 24.7 and the throughput was noticeable slower as well. At about -P 6 is when performance recovered and was more in line with 24.1. This was even in what I consider the "no fault state" that would show issue when running with more than 6 streams.

@AdSchellevis Can you try to repro this? Very easy to do by using my test config, booting from a USB key and using the configuration importer. Lover that feature... :-)

AdSchellevis commented 1 month ago

We have our own tests, load might vary a bit between versions, but that's not really something we can fix I'm afraid, unless we can put our fingers on a clear bug somewhere. Over the years we have seen more variations, overall the consistency of the performance does seem to improve to be hoonest.

There is however a logical explanation why less flows result in perceived higher load on cores as the chances two or more of them end up on the same core are just rather high.

svengrun commented 1 month ago

Understand that this is a complex issue. I ran more tests and the best way to describe the "bug" is that load distribution of if_io_tqg is different between 24.1 and 24.7. In 24.7 the new FreeBSD version is very inconsistent in how network load is scales across cores. Running with a few streams shows the issues better e.g. I ran -P 4 -R and got results that were between 5.5 and 9 Gbits/secs. Looking at top -CHIPS at the same time shows that higher speed results were associated with a better distribution between CPU cores, while the lowest speed runs had 100% of the load on single core.

svengrun commented 1 month ago

I finally found the issue and if you agree, we can close this bug.

My configuration (also the example attached) does have RSS enabled for the ax0 and ax1 NICs (dev.ax.X.rss_enabled = 1) but was missing net.inet.rss.enabled =1. This had no noticeable performance impact using 24.1 but certainly has with 24.7.

Thinking more about how loads are distributed, I finally noticed that netisr did not show up in any of the top -CHIPS output. Adding the above tunable finally turned it on and I can see it now doing what it is supposed to do. :-)

top PID USERNAME 11 root 11 root 11 root 12 root 12 root 12 root 11 root 11 root 12 root 12 root 0 root 11 root 0 root 0 root 0 root 0 root 0 root 0 root 11 root 12 root 11 root 0 root -CHIPS with net.inet.rss.enabled =1 PRI NICE SIZE RES STATE C TIME CPU COMMAND 187 ki31 0B 128K CPU0 0 8:33 82.10% idle{idle: cpu0} 187 ki31 0B 128K RUN 4 8:52 82.04% idle{idle: cpu4} 187 ki31 0B 128K CPU5 5 8:27 80.28% idle{idle: cpu5} -60 - 0B 496K CPU7 7 1:43 75.98% intr{swi1: netisr 7} -60 - 0B 496K CPU2 2 1:25 75.76% intr{swi1: netisr 2} -60 - 0B 496K WAIT 6 0:56 61.61% intr{swi1: netisr 6} 187 ki31 0B 128K RUN 1 7:33 51.29% idle{idle: cpu1} 187 ki31 0B 128K RUN 3 9:19 48.78% idle{idle: cpu3} -60 - 0B 496K WAIT 1 2:38 36.39% intr{swi1: netisr 1} -60 - 0B 496K WAIT 3 0:45 34.77% intr{swi1: netisr 3} -60 - 0B 2144K - 6 0:57 34.42% kernel{if_io_tqg_6} 187 ki31 0B 128K RUN 2 8:32 21.39% idle{idle: cpu2} -60 - 0B 2144K CPU4 4 1:03 16.90% kernel{if_io_tqg_4} -60 - 0B 2144K - 0 0:40 16.76% kernel{if_io_tqg_0} -60 - 0B 2144K - 3 0:27 16.13% kernel{if_io_tqg_3} -60 - 0B 2144K RUN 7 0:32 14.63% kernel{if_io_tqg_7} -60 - 0B 2144K - 5 0:35 12.23% kernel{if_io_tqg_5} -60 - 0B 2144K - 1 0:24 11.91% kernel{if_io_tqg_1} 187 ki31 0B 128K RUN 7 8:18 9.35% idle{idle: cpu7} -60 - 0B 496K WAIT 5 1:34 6.67% intr{swi1: netisr 5} 187 ki31 0B 128K RUN 6 8:31 3.97% idle{idle: cpu6} -60 - 0B 2144K - 2 0:38 2.74% kernel{if_io_tqg_2}

AdSchellevis commented 1 month ago

results vary with rss, the driver itself uses rss when enabled (which is always advisable and enabled by default), net.inet.rss.enabled balances netisr as well, which in some cases offer better results. We usually test without it on our end, but load distribution is certainly different with it.

I'll close the issue as suggested.

opnsense / core

Increased CPU load limits 10G throughput on DEC850 since upgrade to 24.7 #7661