Open llvmbot opened 5 years ago
Looks like there are a multiple issues here
For LU, loop unroller regresses the benchmark performance a lot. Maybe you can reduce LU a bit and find the hot loop which is unrolled too much? -> New bug report.
-Ox -march=native
results again show regressions. Probably SLP and cost model issue?
Can you try it with -Ox -march=native -fno-unroll-loops -fno-slp-vectorize
?
loopinterchange seems to improve things a bit, that's nice.
Thank you David.
Here is the result:
CFLAGS:
╔═══╦═════════════════════════════════════╗
║ 1 ║ -Ox -march=native ║
╟───╫─────────────────────────────────────╢
║ 2 ║ -Ox -march=native -fno-unroll-loops ║
╟───╫─────────────────────────────────────╢
║ 3 ║ -Ox ║
╟───╫─────────────────────────────────────╢
║ 4 ║ -Ox -fno-unroll-loops ║
╟───╫─────────────────────────────────────╢
║ 5 ║ -Ox -mllvm -enable-loopinterchange ║
╚═══╩═════════════════════════════════════╝
Mflops:
╔═════╤══════╤═════════════╤═════════════╤═════════════╤═════════════╗
║ │ 1 │ 2 │ 3 │ 4 │ 5 ║
╠═════╪══════╪═════════════╪═════════════╪═════════════╪═════════════╣
║ -O1 │ 1494 │ 1494 │ 1498 │ 1499 │ 1491 ║
║ ├──────┼─────────────┼─────────────┼─────────────┼─────────────╢
║ │ 1430 │ 1428 │ 1414 │ 1410 │ 1389 ║
║ ├──────┼─────────────┼─────────────┼─────────────┼─────────────╢
║ │ 1117 │ 1118 │ 1117 │ 1117 │ 1117 ║
║ ├──────┼─────────────┼─────────────┼─────────────┼─────────────╢
║ │ 439 │ 439 │ 440 │ 440 │ 438 ║
║ ├──────┼─────────────┼─────────────┼─────────────┼─────────────╢
║ │ 2179 │ 2180 │ 2212 │ 2217 │ 2204 ║
║ ├──────┼─────────────┼─────────────┼─────────────┼─────────────╢
║ │ 2306 │ 2304 │ 2304 │ 2310 │ 2309 ║
╟─────┼──────┼──────┬──────┼──────┬──────┼──────┬──────┼──────┬──────╢
║ -O2 │ 1743 │ 1962 │ │ 1834 │ │ 1748 │ │ 1827 │ ║
║ ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║ │ 1300 │ 1304 │ SLOW │ 1466 │ │ 1318 │ SLOW │ 1465 │ ║
║ ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║ │ 1123 │ 1117 │ │ 1125 │ │ 1113 │ │ 1124 │ ║
║ ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║ │ 440 │ 440 │ │ 440 │ │ 438 │ │ 440 │ ║
║ ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║ │ 1771 │ 2041 │ │ 2210 │ │ 2185 │ │ 2196 │ ║
║ ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║ │ 4083 │ 4909 │ │ 3928 │ │ 3683 │ │ 3912 │ ║
╟─────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║ -O3 │ 1786 │ 1964 │ │ 1820 │ │ 1731 │ │ 1818 │ ║
║ ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║ │ 1304 │ 1298 │ SLOW │ 1455 │ │ 1417 │ │ 1451 │ ║
║ ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║ │ 1128 │ 1117 │ │ 1126 │ │ 1114 │ │ 1127 │ ║
║ ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║ │ 439 │ 439 │ │ 437 │ │ 439 │ │ 442 │ ║
║ ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║ │ 1896 │ 2040 │ │ 2205 │ │ 2152 │ │ 2212 │ ║
║ ├──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────┼──────╢
║ │ 4163 │ 4928 │ │ 3875 │ SLOW │ 3533 │ SLOW │ 3858 │ SLOW ║
╚═════╧══════╧══════╧══════╧══════╧══════╧══════╧══════╧══════╧══════╝
It can be seen some trade-offs among these options, so maybe there is no best selection of options?
Can you measure without -march=native
? I found some cases (spec) when -march=native
regresses the performance (SLP vectorizer..)
Or try to measure with -fno-unroll-loops
; sometimes LLVM unrolls a lot :/
gcc with -O3
is very good, (because they enable -floop-interchange
with -O3
?).
Try to measure again clang with clang -O3 -mllvm -enable-loopinterchange
.
General question:
Should bots somehow track the perf of test-suite benchmarks compiled with -march=native
and without -march=native
?
@llvm/issue-subscribers-backend-x86
Author: None (llvmbot)
@llvm/issue-subscribers-backend-x86
Author: None (llvmbot)
Extended Description
FFT and Sparse matmult in Scimark2(https://math.nist.gov/scimark2/download_c.html) are slower with
-O2
&-O3
than-O1
.In gcc-9.2.0, everything is OK.
All the experiments are carried out in CentOS 7.6.1810 (Core)