CONFIG_OPTIMIZE_INLINING reduces network performance on 5.4 stable

dcrawford1 commented 2 years ago

I discovered that If I revert ac7c3e4ff401b30 compiler: enable CONFIG_OPTIMIZE_INLINING forcibly and disable CONFIG_OPTIMIZE_INLINING it improves network performance. At least on our slow MPC8248. The following perf captures were on kernel 5.4.189 built with gcc 7.5.0

flamegraph of iperf3 running with CONFIG_OPTIMIZE_INLINING enabled (this is the mainline default) iperf3 bandwidth 85 Mb/s, 70 Mb/s (with perf running) perf-inline-optimization

famegraph of iperf3 running with CONFIG_OPTIMIZE_INLINING disabled: iperf3 bandwidth 95 Mb/s, 80 Mb/s (with perf running) perf-no-inline-optimization

When CONFIG_OPTIMIZE_INLINING is disabled the cpu usage for softirqd is much lower.

chleroy commented 2 years ago

Can you provide your .config ?

Do you build it optimised for speed or optimised for size ?

Can you provide a text version of your perf reports, both with and without call graph ?

Did you try with latest mainline kernel ? Because I have fixed several inlining issues recently, mainly around checksum calculations.

dcrawford1 commented 2 years ago

This attachment contains the .config and perf text versions. mpc8248-inline-optimization-test.tar.gz

The last time I tried the 5.14 kernel it was too big to fit into our mtd partition. I can try disabling some unrelated parts to make it fit. Are there any commits I could try to cherry-pick on top of the 5.4.x branch?

dcrawford1 commented 2 years ago

All my tests previously were with optimize for size. I ran a few more iperf tests with different options:

optimize_inlining=y optimize_size kernel size: 4158019, iperf3 bandwidth: 84 Mb/s
optimize_inlining=y optimize_speed kernel size 519840392, iperf3 bandwidth: 92 Mb/s
optimize_inlining=n optimize_size kernel size 4297283, iperf3 bandwidth: 93 Mb/s

dcrawford1 commented 2 years ago

Few more interesting things. With CONFIG_OPTIMIZE_INLINING disabled my kernel will hang between right after devtmpfs but before Freeing unused kernel memory is printed.

[    0.956207] devtmpfs: mounted
[    0.961513] Freeing unused kernel memory: 124K

But, if I disable memory control groups it eliminates the problem and starts fine (that is how I ran the previous perf tests wtih CONFIG_OPTIMIZE_INLINING disabled).

Strangely, I can also eliminate the problem and still enable memory control groups if I simply enable the page memory allocation debugging CONFIG_DEBUG_PAGEALLOC and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT. This debugging never reports any problems, but simply having this kernel feature enabled is enough to get past the hang right before "Freeing unused kernel memory"

I encounter the same hang if either CONFIG_PPC_KUEP and CONFIG_PPC_KUAP or CONFIG_PPC_KUAP_DEBUG is enabled or disabled.

This hang right before "Freeing unused kernel memory" is the same symptom that I saw during my kernel 5.2 testing in issue #258. The fix for that supposedly was commit powerpc/32s: Fix bad_kuap_fault() It appears that issue is not completely resolved and I verified this commit is in the 5.4.189 kernel I am testing.

chleroy commented 2 years ago

Related patches:

- 328e7e487a46 powerpc: force inlining of csum_partial() to avoid multiple csum_partial() with GCC10
- 4423eff71ca6 powerpc: Force inlining of csum_add()
- 5486f5bf790b net: Force inlining of checksum functions in net/checksum.h

In fact what you can just do is change all static inline by static __always_inline in include/net/checksum.h and arch/powerpc/include/asm/checksum.h

chleroy commented 2 years ago

This attachment contains the .config and perf text versions. mpc8248-inline-optimization-test.tar.gz

The last time I tried the 5.14 kernel it was too big to fit into our mtd partition. I can try disabling some unrelated parts to make it fit. Are there any commits I could try to cherry-pick on top of the 5.4.x branch?

How do you boot your target, do you use U-boot ? Is your kernel compressed ? If it's just gzipped, can you use lzma instead ?

Otherwise, are you able to download the kernel at boot through tftp ?

dcrawford1 commented 2 years ago

So, I cherry-picked these commits on 5.4.189: (5486f5bf790b was already in 5.4.189)

328e7e487a46 powerpc: force inlining of csum_partial() to avoid multiple csum_partial() with GCC10

4423eff71ca6 powerpc: Force inlining of csum_add()

The iperf3 test improved slightly from ~85 Mb/s to ~87Mb/s and the ksoftirqd cpu usage stayed the same at about 10%

I was able to boot with 5.15.35 (using a compressed kernel) and the iperf3 performance was also about ~87 Mb/s. But, the ksoftirqd usage was huge ~25%.

I tried the latest linux master, but could not boot due to this error:

ERROR: Failed to allocate 0x100 bytes below 0x800000.
ERROR with allocation of cmdline

At this point, I think it is best to stick with the 4.19 kernel with ~95Mb/s bandwidth and 1% softirqd cpu usage

linuxppc / issues

CONFIG_OPTIMIZE_INLINING reduces network performance on 5.4 stable #406