Open namniav opened 2 years ago
@llvm/issue-subscribers-backend-aarch64
So IIUC the issue is with upstream Clang and Apple Clang doesn't have the runtime difference? Could you share the assembly generated by both + the Apple Clang version.
assembly-by-upstream-Clang.txt assembly-by-Apple-Clang.txt
❯ /usr/bin/clang --version
Apple clang version 13.0.0 (clang-1300.0.29.3)
Target: arm64-apple-darwin21.4.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
Output of program generated by Apple Clang:
loop <2> checksum=4255083000 time=75ms
unroll<2> checksum=4255083000 time=48ms
loop <3> checksum=4255083000 time=55ms
unroll<3> checksum=4255083000 time=55ms
So IIUC the issue is with upstream Clang and Apple Clang doesn't have the runtime difference?
@fhahn I think the main issue here is that
I am not comparing upstream Clang with Apple Clang,but with itself.
I'm not that surprised that it takes a bad decision, the scheduler it uses is for the A7 cyclone from 2013, which probably has very different characteristics. Hopefully Apple can upstream a more accurate model. We see similar cases but in the opposite direction in julia. Some loops it just does a 2x unroll where a 8x unroll is almost 4x faster.
Can you try after this change? https://reviews.llvm.org/D119788 (you might have already)
Yeah. It was with that, I can also see a difference when compiling a simple C++ reduction with apple clang vs clang, where apple does a 4x unroll but normal clang does 2x. I suspect the issue is with https://github.com/llvm/llvm-project/blob/1534177f8f7edd83083ceda7c14d6d40cc872c6e/llvm/lib/Target/AArch64/AArch64.td#L1200-L1203
Where the scheduling model is for the A7 cyclone https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64SchedCyclone.td, which leads it to taking non optimal decisions with respect to unrolling.
Output on my M1 macbook air(compiled by
clang++ -Wall -Wextra -Werror -std=c++20 -O3 -fno-lto test.cpp
):Issue 1: Compared to
cond_unroll<3>
,cond_loop<3>
makesconditional_sum
3X faster ! I don't see the difference from source code except thatcond_unroll<3>
unrolls the trivial loop incond_loop<3>
.Issue 2: Compared to
cond_loop<2>
,cond_loop<3>
makesconditional_sum
3X faster ! The difference is thatcond_loop<3>
have an addtional positive interval. But for positivesigned char data[i]
,sum += data[i]
is equivalent tosum += data[i] & 0xff
. Why adding a useless positive interval causecondition_sum
3X faster?Note that Clang on my desktop PC(Ubuntu20.04 running on Intel CPU) doesn't have this issue. This might be related to specific platform. Apple's Clang also doesn't have this issue.
Versions:
-v
output: