Open dcrawford1 opened 2 years ago
Can you provide your .config ?
Do you build it optimised for speed or optimised for size ?
Can you provide a text version of your perf reports, both with and without call graph ?
Did you try with latest mainline kernel ? Because I have fixed several inlining issues recently, mainly around checksum calculations.
This attachment contains the .config and perf text versions. mpc8248-inline-optimization-test.tar.gz
The last time I tried the 5.14 kernel it was too big to fit into our mtd partition. I can try disabling some unrelated parts to make it fit. Are there any commits I could try to cherry-pick on top of the 5.4.x branch?
All my tests previously were with optimize for size. I ran a few more iperf tests with different options:
Few more interesting things. With CONFIG_OPTIMIZE_INLINING disabled my kernel will hang between right after devtmpfs but before Freeing unused kernel memory is printed.
[ 0.956207] devtmpfs: mounted
[ 0.961513] Freeing unused kernel memory: 124K
But, if I disable memory control groups it eliminates the problem and starts fine (that is how I ran the previous perf tests wtih CONFIG_OPTIMIZE_INLINING disabled).
Strangely, I can also eliminate the problem and still enable memory control groups if I simply enable the page memory allocation debugging CONFIG_DEBUG_PAGEALLOC
and CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT
. This debugging never reports any problems, but simply having this kernel feature enabled is enough to get past the hang right before "Freeing unused kernel memory"
I encounter the same hang if either CONFIG_PPC_KUEP and CONFIG_PPC_KUAP or CONFIG_PPC_KUAP_DEBUG is enabled or disabled.
This hang right before "Freeing unused kernel memory" is the same symptom that I saw during my kernel 5.2 testing in issue #258. The fix for that supposedly was commit powerpc/32s: Fix bad_kuap_fault() It appears that issue is not completely resolved and I verified this commit is in the 5.4.189 kernel I am testing.
Related patches:
- 328e7e487a46 powerpc: force inlining of csum_partial() to avoid multiple csum_partial() with GCC10
- 4423eff71ca6 powerpc: Force inlining of csum_add()
- 5486f5bf790b net: Force inlining of checksum functions in net/checksum.h
In fact what you can just do is change all static inline
by static __always_inline
in include/net/checksum.h
and arch/powerpc/include/asm/checksum.h
This attachment contains the .config and perf text versions. mpc8248-inline-optimization-test.tar.gz
The last time I tried the 5.14 kernel it was too big to fit into our mtd partition. I can try disabling some unrelated parts to make it fit. Are there any commits I could try to cherry-pick on top of the 5.4.x branch?
How do you boot your target, do you use U-boot ? Is your kernel compressed ? If it's just gzipped, can you use lzma instead ?
Otherwise, are you able to download the kernel at boot through tftp ?
So, I cherry-picked these commits on 5.4.189: (5486f5bf790b was already in 5.4.189)
- 328e7e487a46 powerpc: force inlining of csum_partial() to avoid multiple csum_partial() with GCC10
- 4423eff71ca6 powerpc: Force inlining of csum_add()
The iperf3 test improved slightly from ~85 Mb/s to ~87Mb/s and the ksoftirqd cpu usage stayed the same at about 10%
I was able to boot with 5.15.35 (using a compressed kernel) and the iperf3 performance was also about ~87 Mb/s. But, the ksoftirqd usage was huge ~25%.
I tried the latest linux master, but could not boot due to this error:
ERROR: Failed to allocate 0x100 bytes below 0x800000.
ERROR with allocation of cmdline
At this point, I think it is best to stick with the 4.19 kernel with ~95Mb/s bandwidth and 1% softirqd cpu usage
I discovered that If I revert
ac7c3e4ff401b30 compiler: enable CONFIG_OPTIMIZE_INLINING forcibly
and disable CONFIG_OPTIMIZE_INLINING it improves network performance. At least on our slow MPC8248. The following perf captures were on kernel 5.4.189 built with gcc 7.5.0flamegraph of iperf3 running with CONFIG_OPTIMIZE_INLINING enabled (this is the mainline default) iperf3 bandwidth 85 Mb/s, 70 Mb/s (with perf running)
famegraph of iperf3 running with CONFIG_OPTIMIZE_INLINING disabled: iperf3 bandwidth 95 Mb/s, 80 Mb/s (with perf running)
When CONFIG_OPTIMIZE_INLINING is disabled the cpu usage for softirqd is much lower.