Open llvmbot opened 7 years ago
In the correct_schedule dump, it looks like the pre-RA scheduler is scheduling 65 instructions in 66 cycles.
Scheduling SU(63) %vreg175
= EORWrs %vreg105, %vreg136, 206; G#32 :%vreg175,%vreg105,%vreg136 Ready @2c ** ScheduleDAGMILive::schedule picking next node SU(55) A53UnitALU=3c SU(47) A53UnitALU=3c SU(39) A53UnitALU=3c
checkHazard thinks that the ALU is reserved until the next cycle. Like there's only one ALU.
Try with the latest LLVM if you can. Then I think you could create a very small .ll test case to expose this problem. Sorry I'm not available to debug it.
Thanks Andrew.
This is the latest LLVM ('ish)
$ ./clang --version clang version 5.0.0 (https://github.com/llvm-mirror/clang 31801a78220872a1dcee9b261328ce7239eaa7e9) (https://github.com/llvm-mirror/llvm 39984d5813cf79959c014093f5b9a3c71a6d7272) Target: aarch64-unknown-linux-gnu Thread model: posix InstalledDir: /home/davide/compilers/clang-install/bin/.
I'll try to reduced to something manageable and debug. Sorry I can't be more helpful, the scheduler is not my cup of tea (just happened to notice this problem while looking at something else).
In the correct_schedule dump, it looks like the pre-RA scheduler is scheduling 65 instructions in 66 cycles.
Scheduling SU(63) %vreg175
** ScheduleDAGMILive::schedule picking next node SU(55) A53UnitALU=3c SU(47) A53UnitALU=3c SU(39) A53UnitALU=3c
checkHazard thinks that the ALU is reserved until the next cycle. Like there's only one ALU.
Try with the latest LLVM if you can. Then I think you could create a very small .ll test case to expose this problem. Sorry I'm not available to debug it.
It's scheduling bottom up, and after scheduling two
rotates
, decides to avoid increasing register pressure by scheduling theadd
. Supposedly, that's ok because the machine is dual-issue and the instructions only have one cycle latency.
I also tried to disable register pressure as input to the scheduler and the result I got is very similar (snippet:
[...] 40: 0b120113 add w19, w8, w18 44: 0b0d0134 add w20, w9, w13 48: 4ad365ce eor w14, w14, w19, ror #25 4c: 4ad464a5 eor w5, w5, w20, ror #25 50: 0b1201d3 add w19, w14, w18 54: 0b0d00b4 add w20, w5, w13 58: 4ad35d6b eor w11, w11, w19, ror #23 5c: 0b020195 add w21, w12, w2 60: 4ad45e10 eor w16, w16, w20, ror #23 64: 0b0e0173 add w19, w11, w14 68: 0b050214 add w20, w16, w5 6c: 4ad56484 eor w4, w4, w21, ror #25 70: 4ad34d08 eor w8, w8, w19, ror #19 74: 0b0f0153 add w19, w10, w15 78: 0b020095 add w21, w4, w2 7c: 4ad44d29 eor w9, w9, w20, ror #19 80: 4ad364e7 eor w7, w7, w19, ror #25 84: 0b100133 add w19, w9, w16 88: 4ad55cc6 eor w6, w6, w21, ror #23 8c: 0b0400d5 add w21, w6, w4 90: 4ad339ad eor w13, w13, w19, ror #14 94: 0b0f00f3 add w19, w7, w15 98: 0b0b0114 add w20, w8, w11 9c: 4ad54d8c eor w12, w12, w21, ror #19 a0: 4ad35e31 eor w17, w17, w19, ror #23 a4: 4ad43a52 eor w18, w18, w20, ror #14 a8: 0b060194 add w20, w12, w6 ac: 0b070233 add w19, w17, w7 b0: 4ad43842 eor w2, w2, w20, ror #14 b4: 4ad34d4a eor w10, w10, w19, ror #19 b8: 0b1200f4 add w20, w7, w18 bc: 0b110153 add w19, w10, w17 c0: 4ad46529 eor w9, w9, w20, ror #25 c4: 4ad339ef eor w15, w15, w19, ror #14 c8: 0b120133 add w19, w9, w18 cc: 0b0d01d4 add w20, w14, w13 d0: 4ad35cc6 eor w6, w6, w19, ror #23 d4: 4ad4658c eor w12, w12, w20, ror #25 [...]
It's scheduling bottom up, and after scheduling two
rotates
, decides to avoid increasing register pressure by scheduling theadd
. Supposedly, that's ok because the machine is dual-issue and the instructions only have one cycle latency.
I'll take a look at the GCC scheduling model for the cortex and see why they schedule things this way.
In the LLVM version, there are 9 adds + 9 eors instead of 8. So it's off to a bad start.
I was confused by this comment as there should be 8 adds and eors also in the LLVM version, so I suddenly realized I could've made a mistake. In fact, I uploaded a stale version of -debug-only=misched
output, sorry. New one attached to this reply.
Also, in the gcc version, I don't see any cyclic dependencies between the adds and the eors. If you do have OOO, that's why you have stalls.
I don't have an OOO, this is run on an in-order processor.
It's scheduling bottom up, and after scheduling two rotates
, decides to avoid increasing register pressure by scheduling the add
. Supposedly, that's ok because the machine is dual-issue and the instructions only have one cycle latency.
In the LLVM version, there are 9 adds + 9 eors instead of 8. So it's off to a bad start.
Also, in the gcc version, I don't see any cyclic dependencies between the adds and the eors. If you do have OOO, that's why you have stalls.
Thanks. That explains it :).
Do you know why the LLVM version is slower?
I mean from a high level view (meaning I actually didn't look at the dependencies and such), I would have expected the OOO engine to perform well in both cases.
It does on an out-of-order engine (checked) but there are in order processors available, still (same would be true for ARM, e.g. Cortex-A7 which suffers from the same stalling problem). I measured with a CPU profiler on Linux and the two run show a substantial difference in the number of stalls.
Cortex-A53 also doesn't have an out of order engine (I think). It's the little core.
Oh I see that there are difference in the ror modifier. Does that change the dependencies graph?
Do you know why the LLVM version is slower?
I mean from a high level view (meaning I actually didn't look at the dependencies and such), I would have expected the OOO engine to perform well in both cases.
For easier fruition, on godbolt
Please note that gcc generates the same sequence (4 add followed by for eor, grouped by ror
shift even without tuning, so maybe there's a missing transform in llvm that groups these together. Thoughts?
Extended Description
This is the salsa20 benchmark from the testsuite (SingleSource). I'm not sure if the model can be improved or this is a general issue with the instruction scheduler heuristics.
passing -O3
-mcpu=cortex-a53 -mtune=cortex-a53
LLVM generates the following code for the hot loop (subset of instructions):while gcc 7:
The latter results in many more stalls and ~ 20% runtime regression. SelectionDAG for the BB pre scheduling and initial IR attached.