Open rsc opened 3 years ago
Here is another, separate opportunity, for GOAMD64=v3 compilation. The SHRXQ instruction takes an explicit shift register, has separate source and destination operands, and can read source from memory. That allows reducing the loop to
body:
MOVBLZX (AX)(DI*1), SI // b = p[i]
LEAQ 1(DI), DI // i++
SHRXQ DX, (R8)(SI*8), DX // state = dfa[b] >> (state&63)
That change runs at 3400 MB/s (!).
(The DFA tables were carefully constructed exactly to enable this implementation.)
@rsc sorry for hijacked, but what means GOAMD64=v3
?
I see this hasn't had attention for a while but this is a problem I've noticed in ppc64 code too. Invariant values are not moved out of loops. I thought at one time there was work to do this but it must have been abandoned.
Here is one example:
ff0a0: 00 00 e3 8b lbz r31,0(r3) <--- nil check is not needed on each iteration
ff0a4: 24 1f c7 78 rldicr r7,r6,3,60 \
ff0a8: 14 3a 03 7d add r8,r3,r7 / These two could be strength-reduced?
ff0ac: 00 00 28 e9 ld r9,0(r8)
ff0b0: 2a 20 47 7d ldx r10,r7,r4
ff0b4: 2a 28 67 7d ldx r11,r7,r5
ff0b8: 78 52 6a 7d xor r10,r11,r10
ff0bc: b0 00 61 39 addi r11,r1,176 <---- this is invariant in the loop
ff0c0: 2a 58 e7 7c ldx r7,r7,r11
ff0c4: 78 52 e7 7c xor r7,r7,r10
ff0c8: 78 3a 27 7d xor r7,r9,r7
ff0cc: 00 00 e8 f8 std r7,0(r8)
ff0d0: 01 00 c6 38 addi r6,r6,1
ff0d4: 80 00 26 2c cmpdi r6,128
ff0d8: c8 ff 80 41 blt ff0a0 <golang.org/x/crypto/argon2.processBlockGeneric+0x3a0>
Change https://go.dev/cl/385174 mentions this issue: cmd/compile: use shlx&shrx instruction for GOAMD64>=v3
The generated x86 code can be improved in some fairly simple ways - hoisting computed constants out of loop bodies, and avoiding unnecessary register moves - that have a significant performance impact on tight loops. In the following example those improvements produce a 35% speedup.
Here is an alternate, DFA-based implementation of
utf8.Valid
that I have been playing with:There are no big benchmarks of Valid in the package, but here are some that could be added:
The old Valid implementation runs at around 1450 MB/s. The implementation above runs at around 1600 MB/s. Better but not what I had hoped. It compiles as follows:
Translating this to proper non-regabi assembly I get:
This runs also at about 1600 MB/s.
First optimization: the
LEAQ ·dfa(SB), R8
should be hoisted out of the loop body. (I tried to do this in the Go version withdfa := &dfa
but it got constant propagated away!)That change brings it up to 1750 MB/s.
Second optimization: use DI for
i
instead of CX, to avoid the pressure on CX. This lets theLEAQ 1(CX), DI
and the laterMOVQ DI, CX
collapse to justLEAQ 1(DI), DI
.That change brings it up to 1900 MB/s.
The body is now:
Third optimization: since
DX
is moving intoCX
, do that one instruction earlier, allowing the use ofSI
to be optimized intoDX
to eliminate the finalMOVQ
:I think this ends up being just "compute the shift amount before the shifted value". That change brings it up to 2150 MB/s.
This is still a direct translation of the Go code: there are no tricks the compiler couldn't do. For this particular loop, the optimizations make the code run 35% faster.
Final assembly: