golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.01k stars 17.54k forks source link

crypto/cipher: easy optimisations to xorBytes #53023

Open jech opened 2 years ago

jech commented 2 years ago

The function crypto/cipher.xorBytes implements the XOR operations between slices of bytes, and is heavily used by encryption in CTR mode. The version in the stdlib has seen a significant amount of optimisation, but some easy optimisations are missing:

  1. The variable supportUnaligned in xor_generic.go should be set for GOARCH equal to arm64.
  2. Said variable sohould be set for GOARCH equal to ARM and GOARM equal to 6 or 7 (but not 5).
  3. When supportsUnaligned is not set, the fast version should still be called when both src and dst are multiples of wordSize (this actually happens fairly often, e.g. when encrypting a freshly allocated slice);
  4. When supportsUnaligned is not set, and src and dst are equal modulo wordSize, then the initial bytes should be xored and then the fast version called on the rest of the array;
  5. Ideally, there should be a NEON implementation for ARMv7.

Points 1 through 3 are fairly trivial, and would bring a significant part of the benefit. This is related to #53021.

adriancable commented 2 years ago

@jech - point 1 is unnecessary because arm64 is excluded via the go:build line in xor_generic.go (due to the presence of the arm64 neon assembly implementation).

But an emphatic yes to the other points. Especially point 2 where we are currently leaving significant gains on the table on a performance critical function on ARMv6/7 where unaligned access is OK.

There are frequent posts regarding golang’s slow crypto performance on arm32 and this would be a low hanging fruit change with big gains.

josharian commented 2 years ago

cc @bradfitz

bcmills commented 2 years ago

See also #35381.

adriancable commented 2 years ago

@bcmills / @bradfitz / @jech - I have written both NEON and non-NEON versions of xorBytes for 32-bit ARM. The performance increase over the 'slow' version in xor_generic.go is significant, between 4X and 20X.

Given that 32-bit ARM remains one of the most popular IoT device platforms, it's a shame that 32-bit ARM is the sole unoptimised platform for 32-bit ARM in the crypto library for xorBytes, which can be a bottleneck function for crypto.

So, I would be very happy to create a PR here to contribute optimised 32-bit NEON and non-NEON ARM implementations of xorBytes. This could be a significant deal for crypto performance in general on 32-bit ARM. But I would appreciate a signal first that there is the appetite to accept such a PR, so it does not end up in /dev/null.

gopherbot commented 2 years ago

Change https://go.dev/cl/409394 mentions this issue: crypto/cipher: add optimized assembly xorBytes for ARM (NEON + non-NEON)