port amd64 assembly to arm64

AWSjswinney commented 4 years ago

This change was produced by taking the amd64 assembly and reproducing it as closely as possible for the arm64 arch.

The main differences:

arm64 uses registers R1-R17 which are mapped directly onto an amd64 counterpart
arm64 requires 8 additional bytes of stack so callee args are displaced by 8 bytes from amd64
operands to CMP instructions are reversed except in a few cases where arm64 uses a BLS (branch less-same) instead of JAE (jump above-equal)
immediates in some cases have to be split to a separate MOVD instruction
shifts can be combined with another instruction, such as an ADD, in some cases
The amd64 BSFQ instruction is implemented with a bit reversal and leading zero count instruction
memclear on arm64 makes use of the SIMD instructions to clear 64 bytes at a time and uses a pointer comparison instead of a counter to reduce the number of instructions in the loop

Tested on an AWS m6g.large (ARMv8.2):

name              old time/op    new time/op     delta
WordsDecode1e1-2    29.2ns ± 0%     26.2ns ± 1%   -10.51%  (p=0.000 n=9+10)
WordsDecode1e2-2     187ns ± 0%      107ns ± 0%   -42.78%  (p=0.000 n=7+10)
WordsDecode1e3-2    2.16µs ± 1%     0.95µs ± 0%   -55.85%  (p=0.000 n=10+10)
WordsDecode1e4-2    30.1µs ± 0%     10.4µs ± 2%   -65.40%  (p=0.000 n=10+10)
WordsDecode1e5-2     348µs ± 0%      168µs ± 0%   -51.86%  (p=0.000 n=10+9)
WordsDecode1e6-2    3.47ms ± 0%     1.71ms ± 0%   -50.66%  (p=0.000 n=10+10)
WordsEncode1e1-2    19.4ns ± 0%     21.7ns ± 1%   +12.06%  (p=0.000 n=8+10)
WordsEncode1e2-2    2.09µs ± 0%     0.25µs ± 0%   -88.14%  (p=0.000 n=9+10)
WordsEncode1e3-2    6.67µs ± 1%     2.49µs ± 0%   -62.63%  (p=0.000 n=10+10)
WordsEncode1e4-2    63.5µs ± 1%     29.4µs ± 1%   -53.63%  (p=0.000 n=10+9)
WordsEncode1e5-2     722µs ± 0%      345µs ± 0%   -52.21%  (p=0.000 n=10+10)
WordsEncode1e6-2    7.17ms ± 0%     3.41ms ± 0%   -52.46%  (p=0.000 n=10+8)
RandomEncode-2       106µs ± 2%       78µs ± 0%   -26.02%  (p=0.000 n=10+10)
_UFlat0-2            152µs ± 0%       69µs ± 1%   -54.90%  (p=0.000 n=10+9)
_UFlat1-2           1.57ms ± 0%     0.77ms ± 0%   -51.10%  (p=0.000 n=9+10)
_UFlat2-2           6.84µs ± 0%     6.55µs ± 0%    -4.25%  (p=0.000 n=10+8)
_UFlat3-2            312ns ± 0%      183ns ± 0%   -41.35%  (p=0.000 n=10+9)
_UFlat4-2           15.4µs ± 1%      9.7µs ± 1%   -36.79%  (p=0.000 n=10+10)
_UFlat5-2            625µs ± 0%      301µs ± 1%   -51.88%  (p=0.000 n=9+10)
_UFlat6-2            570µs ± 0%      278µs ± 0%   -51.18%  (p=0.000 n=10+9)
_UFlat7-2            490µs ± 0%      240µs ± 1%   -50.95%  (p=0.000 n=10+10)
_UFlat8-2           1.52ms ± 0%     0.74ms ± 0%   -51.01%  (p=0.000 n=8+7)
_UFlat9-2           2.00ms ± 0%     1.01ms ± 0%   -49.49%  (p=0.000 n=10+10)
_UFlat10-2           132µs ± 0%       62µs ± 2%   -53.19%  (p=0.000 n=10+10)
_UFlat11-2           497µs ± 0%      258µs ± 0%   -48.11%  (p=0.000 n=10+9)
_ZFlat0-2            346µs ± 1%      136µs ± 5%   -60.70%  (p=0.000 n=10+9)
_ZFlat1-2           3.63ms ± 0%     1.76ms ± 0%   -51.60%  (p=0.000 n=10+8)
_ZFlat2-2           13.2µs ± 0%      9.5µs ± 0%   -27.62%  (p=0.000 n=8+9)
_ZFlat3-2           2.49µs ± 0%     0.45µs ± 0%   -81.96%  (p=0.002 n=8+10)
_ZFlat4-2           50.5µs ± 0%     15.7µs ± 1%   -68.96%  (p=0.000 n=10+9)
_ZFlat5-2           1.40ms ± 0%     0.56ms ± 0%   -60.20%  (p=0.000 n=9+9)
_ZFlat6-2           1.13ms ± 0%     0.54ms ± 0%   -52.39%  (p=0.000 n=10+9)
_ZFlat7-2            961µs ± 0%      472µs ± 0%   -50.83%  (p=0.000 n=10+10)
_ZFlat8-2           3.03ms ± 0%     1.43ms ± 0%   -52.90%  (p=0.000 n=9+10)
_ZFlat9-2           3.88ms ± 0%     1.95ms ± 0%   -49.72%  (p=0.000 n=10+10)
_ZFlat10-2           339µs ± 0%      123µs ± 3%   -63.82%  (p=0.000 n=10+10)
_ZFlat11-2           973µs ± 0%      433µs ± 0%   -55.49%  (p=0.000 n=10+10)
ExtendMatch-2       22.1µs ± 1%      9.8µs ± 0%   -55.63%  (p=0.000 n=10+10)

name              old speed      new speed       delta
WordsDecode1e1-2   342MB/s ± 0%    382MB/s ± 1%   +11.77%  (p=0.000 n=9+10)
WordsDecode1e2-2   535MB/s ± 0%    934MB/s ± 0%   +74.43%  (p=0.000 n=10+10)
WordsDecode1e3-2   463MB/s ± 1%   1049MB/s ± 0%  +126.52%  (p=0.000 n=10+10)
WordsDecode1e4-2   333MB/s ± 0%    961MB/s ± 2%  +189.04%  (p=0.000 n=10+10)
WordsDecode1e5-2   287MB/s ± 0%    597MB/s ± 0%  +107.72%  (p=0.000 n=10+9)
WordsDecode1e6-2   288MB/s ± 0%    584MB/s ± 0%  +102.67%  (p=0.000 n=10+10)
WordsEncode1e1-2   515MB/s ± 0%    460MB/s ± 0%   -10.70%  (p=0.000 n=10+10)
WordsEncode1e2-2  47.8MB/s ± 0%  403.3MB/s ± 0%  +743.40%  (p=0.000 n=10+10)
WordsEncode1e3-2   150MB/s ± 1%    401MB/s ± 0%  +167.66%  (p=0.000 n=10+9)
WordsEncode1e4-2   157MB/s ± 1%    340MB/s ± 1%  +115.66%  (p=0.000 n=10+9)
WordsEncode1e5-2   138MB/s ± 0%    290MB/s ± 0%  +109.24%  (p=0.000 n=10+10)
WordsEncode1e6-2   139MB/s ± 0%    293MB/s ± 0%  +110.35%  (p=0.000 n=10+8)
RandomEncode-2    9.93GB/s ± 2%  13.42GB/s ± 0%   +35.15%  (p=0.000 n=10+10)
_UFlat0-2          672MB/s ± 0%   1489MB/s ± 1%  +121.75%  (p=0.000 n=10+9)
_UFlat1-2          446MB/s ± 0%    913MB/s ± 0%  +104.48%  (p=0.000 n=9+10)
_UFlat2-2         18.0GB/s ± 0%   18.8GB/s ± 0%    +4.44%  (p=0.000 n=8+8)
_UFlat3-2          641MB/s ± 0%   1091MB/s ± 0%   +70.19%  (p=0.000 n=10+10)
_UFlat4-2         6.66GB/s ± 1%  10.53GB/s ± 1%   +58.19%  (p=0.000 n=10+10)
_UFlat5-2          655MB/s ± 0%   1362MB/s ± 1%  +107.80%  (p=0.000 n=9+10)
_UFlat6-2          267MB/s ± 0%    547MB/s ± 0%  +104.82%  (p=0.000 n=10+9)
_UFlat7-2          255MB/s ± 0%    521MB/s ± 1%  +103.89%  (p=0.000 n=10+10)
_UFlat8-2          281MB/s ± 0%    574MB/s ± 0%  +104.14%  (p=0.000 n=8+7)
_UFlat9-2          241MB/s ± 0%    478MB/s ± 0%   +97.97%  (p=0.000 n=10+10)
_UFlat10-2         896MB/s ± 0%   1914MB/s ± 2%  +113.64%  (p=0.000 n=10+10)
_UFlat11-2         371MB/s ± 0%    715MB/s ± 0%   +92.72%  (p=0.000 n=10+9)
_ZFlat0-2          296MB/s ± 1%    754MB/s ± 5%  +154.57%  (p=0.000 n=10+9)
_ZFlat1-2          194MB/s ± 0%    400MB/s ± 0%  +106.63%  (p=0.000 n=10+8)
_ZFlat2-2         9.35GB/s ± 0%  12.92GB/s ± 0%   +38.17%  (p=0.000 n=8+10)
_ZFlat3-2         80.3MB/s ± 0%  445.6MB/s ± 0%  +454.64%  (p=0.000 n=10+10)
_ZFlat4-2         2.03GB/s ± 0%   6.54GB/s ± 1%  +222.19%  (p=0.000 n=10+9)
_ZFlat5-2          292MB/s ± 0%    733MB/s ± 0%  +151.25%  (p=0.000 n=9+9)
_ZFlat6-2          135MB/s ± 0%    284MB/s ± 0%  +110.05%  (p=0.000 n=10+9)
_ZFlat7-2          130MB/s ± 0%    265MB/s ± 0%  +103.38%  (p=0.000 n=10+10)
_ZFlat8-2          141MB/s ± 0%    299MB/s ± 0%  +112.30%  (p=0.000 n=9+10)
_ZFlat9-2          124MB/s ± 0%    247MB/s ± 0%   +98.90%  (p=0.000 n=10+10)
_ZFlat10-2         350MB/s ± 0%    967MB/s ± 3%  +176.44%  (p=0.000 n=10+10)
_ZFlat11-2         189MB/s ± 0%    426MB/s ± 0%  +124.65%  (p=0.000 n=10+10)

AWSjswinney commented 4 years ago

To make code review easier, you can also see individual commits which show changes from the amd64 implementation. Changes for encode, decode, and a branch with both.

dgryski commented 4 years ago

Has this implementation been tested with go-fuzz?

AWSjswinney commented 4 years ago

I tested with go-fuzz and found an issue, which I have now fixed. Let me know if you want the commits squashed or left separate.

Thanks for the tip on go-fuzz!

AWSjswinney commented 4 years ago

I have run -func Fuzz for about 140 million execs and -func FuzzFraming for about 25 million execs so far.

nigeltao commented 4 years ago

Squashed commit please.

nigeltao commented 4 years ago

arm64 requires 8 additional bytes of stack

I'm sure you're right, but out of curiosity, can you remind me why the ARM64 calling convention requires an extra 8 bytes compared to AMD64?

AWSjswinney commented 4 years ago

arm64 requires 8 additional bytes of stack

I'm sure you're right, but out of curiosity, can you remind me why the ARM64 calling convention requires an extra 8 bytes compared to AMD64?

Thanks for asking. I did some research to confirm that assertion:

According to the Procedure Call Standard for the Arm, the stack pointer must be quad-word aligned, i.e. sp % 16 == 0.

Additionally, from the next section on the frame pointer:

The lowest addressed double-word shall point to the previous frame record

That means that sp+0 will contain the previous frame record pointer and callee args will begin at sp+8.

nigeltao commented 4 years ago

Thanks!

If you're curious and have more spare time than I do... one low-priority thing I was meaning to experiment with at some point was replacing the runtime·memmove call (and the state spillage) with a REP MOVSB (on x86_64, obviously) or an arm64 equivalent. The actual copy might not be as fast but you'd avoid the function call inside the loop. It may or may not be a net win. Feel free to play with that idea.

golang / snappy

port amd64 assembly to arm64 #53