Poor instruction scheduling for salsa20 cypher hot loop

llvmbot commented 7 years ago


Bugzilla Link	32439
Version	trunk
OS	All
Attachments	dag dump pre-scheduling, input IR, -debug-only=misched output
Reporter	LLVM Bugzilla Contributor
CC	@atrick,@efriedma-quic,@MatzeB,@qcolombet,@TNorthover

Extended Description

This is the salsa20 benchmark from the testsuite (SingleSource). I'm not sure if the model can be improved or this is a general issue with the instruction scheduler heuristics.

passing -O3 -mcpu=cortex-a53 -mtune=cortex-a53 LLVM generates the following code for the hot loop (subset of instructions):

  400638:       0b100106        add     w6, w8, w16
  40063c:       0b0d0127        add     w7, w9, w13
  400640:       4ac66400        eor     w0, w0, w6, ror #&#8203;25
  400644:       0b120166        add     w6, w11, w18
  400648:       4ac76463        eor     w3, w3, w7, ror #&#8203;25
  40064c:       4ac66442        eor     w2, w2, w6, ror #&#8203;25
  400650:       0b100006        add     w6, w0, w16
  400654:       0b0d0067        add     w7, w3, w13
  400658:       4ac65d8c        eor     w12, w12, w6, ror #&#8203;23
  40065c:       0b120046        add     w6, w2, w18
  400660:       4ac75def        eor     w15, w15, w7, ror #&#8203;23
  400664:       4ac65c84        eor     w4, w4, w6, ror #&#8203;23
  400668:       0b000186        add     w6, w12, w0
  40066c:       0b0301e7        add     w7, w15, w3
  400670:       4ac64d08        eor     w8, w8, w6, ror #&#8203;19
  400674:       0b020086        add     w6, w4, w2
  400678:       4ac74d29        eor     w9, w9, w7, ror #&#8203;19
  40067c:       4ac64d6b        eor     w11, w11, w6, ror #&#8203;19

while gcc 7:

  400688:       0b020175        add     w21, w11, w2
  40068c:       0b040214        add     w20, w16, w4
  400690:       0b050233        add     w19, w17, w5
  400694:       0b030192        add     w18, w12, w3
  400698:       4ad56508        eor     w8, w8, w21, ror #&#8203;25
  40069c:       4ad464e7        eor     w7, w7, w20, ror #&#8203;25
  4006a0:       4ad364c6        eor     w6, w6, w19, ror #&#8203;25
  4006a4:       4ad26529        eor     w9, w9, w18, ror #&#8203;25
  4006a8:       0b0b0115        add     w21, w8, w11
  4006ac:       0b1000f4        add     w20, w7, w16
  4006b0:       0b1100d3        add     w19, w6, w17
  4006b4:       0b0c0132        add     w18, w9, w12
  4006b8:       4ad55d4a        eor     w10, w10, w21, ror #&#8203;23
  4006bc:       4ad45dad        eor     w13, w13, w20, ror #&#8203;23
  4006c0:       4ad35def        eor     w15, w15, w19, ror #&#8203;23
  4006c4:       4ad25dce        eor     w14, w14, w18, ror #&#8203;23

The latter results in many more stalls and ~ 20% runtime regression. SelectionDAG for the BB pre scheduling and initial IR attached.

llvmbot commented 7 years ago

In the correct_schedule dump, it looks like the pre-RA scheduler is scheduling 65 instructions in 66 cycles.

Scheduling SU(63) %vreg175 = EORWrs %vreg105, %vreg136, 206; G#32 :%vreg175,%vreg105,%vreg136 Ready @2c

** ScheduleDAGMILive::schedule picking next node SU(55) A53UnitALU=3c SU(47) A53UnitALU=3c SU(39) A53UnitALU=3c

checkHazard thinks that the ALU is reserved until the next cycle. Like there's only one ALU.

Try with the latest LLVM if you can. Then I think you could create a very small .ll test case to expose this problem. Sorry I'm not available to debug it.

Thanks Andrew.

This is the latest LLVM ('ish)

$ ./clang --version clang version 5.0.0 (https://github.com/llvm-mirror/clang 31801a78220872a1dcee9b261328ce7239eaa7e9) (https://github.com/llvm-mirror/llvm 39984d5813cf79959c014093f5b9a3c71a6d7272) Target: aarch64-unknown-linux-gnu Thread model: posix InstalledDir: /home/davide/compilers/clang-install/bin/.

I'll try to reduced to something manageable and debug. Sorry I can't be more helpful, the scheduler is not my cup of tea (just happened to notice this problem while looking at something else).

atrick commented 7 years ago

In the correct_schedule dump, it looks like the pre-RA scheduler is scheduling 65 instructions in 66 cycles.

Scheduling SU(63) %vreg175 = EORWrs %vreg105, %vreg136, 206; G#32 :%vreg175,%vreg105,%vreg136 Ready @2c

** ScheduleDAGMILive::schedule picking next node SU(55) A53UnitALU=3c SU(47) A53UnitALU=3c SU(39) A53UnitALU=3c

checkHazard thinks that the ALU is reserved until the next cycle. Like there's only one ALU.

Try with the latest LLVM if you can. Then I think you could create a very small .ll test case to expose this problem. Sorry I'm not available to debug it.

llvmbot commented 7 years ago

It's scheduling bottom up, and after scheduling two rotates, decides to avoid increasing register pressure by scheduling the add. Supposedly, that's ok because the machine is dual-issue and the instructions only have one cycle latency.

I also tried to disable register pressure as input to the scheduler and the result I got is very similar (snippet:

[...] 40: 0b120113 add w19, w8, w18 44: 0b0d0134 add w20, w9, w13 48: 4ad365ce eor w14, w14, w19, ror #25 4c: 4ad464a5 eor w5, w5, w20, ror #25 50: 0b1201d3 add w19, w14, w18 54: 0b0d00b4 add w20, w5, w13 58: 4ad35d6b eor w11, w11, w19, ror #23 5c: 0b020195 add w21, w12, w2 60: 4ad45e10 eor w16, w16, w20, ror #23 64: 0b0e0173 add w19, w11, w14 68: 0b050214 add w20, w16, w5 6c: 4ad56484 eor w4, w4, w21, ror #25 70: 4ad34d08 eor w8, w8, w19, ror #19 74: 0b0f0153 add w19, w10, w15 78: 0b020095 add w21, w4, w2 7c: 4ad44d29 eor w9, w9, w20, ror #19 80: 4ad364e7 eor w7, w7, w19, ror #25 84: 0b100133 add w19, w9, w16 88: 4ad55cc6 eor w6, w6, w21, ror #23 8c: 0b0400d5 add w21, w6, w4 90: 4ad339ad eor w13, w13, w19, ror #14 94: 0b0f00f3 add w19, w7, w15 98: 0b0b0114 add w20, w8, w11 9c: 4ad54d8c eor w12, w12, w21, ror #19 a0: 4ad35e31 eor w17, w17, w19, ror #23 a4: 4ad43a52 eor w18, w18, w20, ror #14 a8: 0b060194 add w20, w12, w6 ac: 0b070233 add w19, w17, w7 b0: 4ad43842 eor w2, w2, w20, ror #14 b4: 4ad34d4a eor w10, w10, w19, ror #19 b8: 0b1200f4 add w20, w7, w18 bc: 0b110153 add w19, w10, w17 c0: 4ad46529 eor w9, w9, w20, ror #25 c4: 4ad339ef eor w15, w15, w19, ror #14 c8: 0b120133 add w19, w9, w18 cc: 0b0d01d4 add w20, w14, w13 d0: 4ad35cc6 eor w6, w6, w19, ror #23 d4: 4ad4658c eor w12, w12, w20, ror #25 [...]

llvmbot commented 7 years ago

correct schedule dump

It's scheduling bottom up, and after scheduling two rotates, decides to avoid increasing register pressure by scheduling the add. Supposedly, that's ok because the machine is dual-issue and the instructions only have one cycle latency.

I'll take a look at the GCC scheduling model for the cortex and see why they schedule things this way.

In the LLVM version, there are 9 adds + 9 eors instead of 8. So it's off to a bad start.

I was confused by this comment as there should be 8 adds and eors also in the LLVM version, so I suddenly realized I could've made a mistake. In fact, I uploaded a stale version of -debug-only=misched output, sorry. New one attached to this reply.

Also, in the gcc version, I don't see any cyclic dependencies between the adds and the eors. If you do have OOO, that's why you have stalls.

I don't have an OOO, this is run on an in-order processor.

atrick commented 7 years ago

It's scheduling bottom up, and after scheduling two rotates, decides to avoid increasing register pressure by scheduling the add. Supposedly, that's ok because the machine is dual-issue and the instructions only have one cycle latency.

In the LLVM version, there are 9 adds + 9 eors instead of 8. So it's off to a bad start.

Also, in the gcc version, I don't see any cyclic dependencies between the adds and the eors. If you do have OOO, that's why you have stalls.

qcolombet commented 7 years ago

Thanks. That explains it :).

llvmbot commented 7 years ago

Do you know why the LLVM version is slower?

I mean from a high level view (meaning I actually didn't look at the dependencies and such), I would have expected the OOO engine to perform well in both cases.

It does on an out-of-order engine (checked) but there are in order processors available, still (same would be true for ARM, e.g. Cortex-A7 which suffers from the same stalling problem). I measured with a CPU profiler on Linux and the two run show a substantial difference in the number of stalls.

TNorthover commented 7 years ago

Cortex-A53 also doesn't have an out of order engine (I think). It's the little core.

qcolombet commented 7 years ago

Oh I see that there are difference in the ror modifier. Does that change the dependencies graph?

qcolombet commented 7 years ago

Do you know why the LLVM version is slower?

I mean from a high level view (meaning I actually didn't look at the dependencies and such), I would have expected the OOO engine to perform well in both cases.

llvmbot commented 7 years ago

For easier fruition, on godbolt

https://godbolt.org/g/8XWHkg

Please note that gcc generates the same sequence (4 add followed by for eor, grouped by ror shift even without tuning, so maybe there's a missing transform in llvm that groups these together. Thoughts?

llvm / llvm-project

Poor instruction scheduling for salsa20 cypher hot loop #31786

Extended Description