proposal: crypto: replace assembly implementations with internal instrinsics

Jorropo commented 11 months ago

There are 2X AES throughput improvements available here #42726. However the assembly implementations are too big: https://go-review.googlesource.com/c/go/+/286852/comments/1b1a6e65_4e27a3f8

From the outside, it seems the peoples able to review crypto CLs are streched thin when problems like #53142 are still open years later.

On the flip side I havn't got any problem getting things merged in the compiler, even complex-ish tricky code like bac4e2f241ca8df3d5be6ddf83214b9a681f4086 were reviewed and merged in time. We even had someone show up one day with a whole new pass, and this was merged within weeks in the same release cycle (which is hard to say about crypto things). (it was reverted but it was fixed and added back, and it's on track for release in go1.22) https://github.com/golang/go/commits/master/src/cmd/compile/internal/ssa/sccp.go

So the crypto peoples seems stretched, the compiler peoples don't look like they are. The crypto peoples don't like writing assembly, the compiler peoples write code which write assembly all day. What if we took the assembly part of crypto assembly and gave it to the compiler peoples ?

Proposal

We could create a package in internal/intrinsics/$GOARCH/, the point of it being in internal is that we don't need to get ergonomics perfect and can evolve them. It is surprisingly easy and has little complexity to wire intrinsic to the compiler (example in https://go-review.googlesource.com/c/go/+/548318, don't look at the whole CL, it's quite big because it do other things, just memclrPointers definition and rules).

While keeping in mind a minimal compiler complexity we could implement AESENC this way, First create body-less functions:

// internal/intrinsics/amd64/aes.go
package amd64

// Requires AES CPUID
func Aesenc(data, key ISimd128) ISimd128
func Aesdec(data, key ISimd128) ISimd128

The compiler would rely on the consumer properly checking CPUID bits. So if I call amd64.Aesenc there is no attempt to fallback if it's unsupported, the compiler always emit AESENC instruction.

ISimd128 would be magic types the type checker would need to know about:

// internal/intrinsics/amd64/aes.go
package amd64

type ISimd128 ISimd128

// We could also only have the bytes ←→ SIMD conversions, and rely on optimizations to generate bigger versions.
func I8x16To128(x [16]uint8) ISimd128
func I16x8To128(x [8]uint16) ISimd128
func I32x4To128(x [4]uint32) ISimd128
func I64x2To128(x [2]uint64) ISimd128

func I128To8x16(x ISimd128) [16]uint8
func I128To16x8(x ISimd128) [8]uint16
func I128To32x4(x ISimd128) [4]uint32
func I128To64x2(x ISimd128) [2]uint64

// equivalent register for floats
type FSimd128 FSimd128

We could also use things like [16]byte or [2]uint64 directly as the simd register, however this makes the compiler more complex because it is responsible of promoting arrays to vector registers.

Then in .rules file we can lower it:

(CALLstatic {sym} data key mem)
    && isSameCall(sym, "internal/intrinsics/amd64.Aesenc")
    => (MakeResult (AESENC <v.Type.Field(0)> data key) mem)

Because this is an internal package, I think it is acceptable to attribute line numbers to caller to intrinsics.

We already have a solid framework to merge theses instructions which operate on registers to ones on memory by combining a few more .rules and addressingmodes.go.

regalloc is already able to handle tricky registering situations like AESENC which use some register as both source and destination.

Memory wise ISimd128 type would have a 16 bytes alignment and would be usable in struct fields, so you can store it on the heap or whatever. The compiler would be able to emit simple MOVQDA. The complex part for the compiler is when you are allowed to do indexing inside the simd register type. Something like:

type ISimd128 [16]byte

//... reg1 and reg2 are ISimd128
var x [32]byte
copy(x[:], reg1)
copy(x[16:], reg2)
copy(reg1[:], x[8:])

In this particular case the compiler would need to know if it's better to use memory, or be smart (here staying in register land and doing PINSERTQ would be the best) This is solved by not being allowed to do it at all, type ISimd128 ISimd128 does not allow for indexing of inner elements. We would instead expose a Shuffle64 intrinsic.

For this limited usecase I don't think we need to apply generic optimizations, the goal is to be nicer to write assembly not compiler optimizer code.

For amd64 particularly all cpus (but zen4) have a cost when mixing older MMX and SSE with newer VEX and EVEX encodings. VEX and EVEX allows to use 256 bits and 512 bits registers and newer AVX, AVX2, AVX512 families of instructions. To do so we would add a new directive (probably also limit it to the std module):

//go:vex
func // ...

vex marked functions would only be allowed to call other vex marked functions, this is because SSE is used in almost all functions for zeroing or copy of fixed size elements. This means a vex function couldn't call the runtime, for crypto use cases this is fine, it is customs to make the actual crypto routines use out parameters instead of allocating and returning a result.

When calling a vex function from a non vex function the compiler would inject vzeroupper after the call. This is so vex functions can call other vex functions and pass ISimd* arguments and return values through ymm and zmm registers.

This is different than previous solutions because it does not aim to provide a generic solution. We would still have an implementation for each architecture we want an assembly. This aims at replacing assembly, we wouldn't spend much time adding optimizations in the compiler, instead we would add access to more instructions and let people write theses optimizations themselves.

The researched gains over assembly are:

Readable optimized implementations
Register allocation
Memory safety
Type safety (https://go-review.googlesource.com/c/go/+/519675 cough cough)
Hopefully, easier on the crypto team.
Preemptable routines (#64417).

seankhliao commented 11 months ago

cc @golang/compiler @golang/security

randall77 commented 11 months ago

We've discussed a more general SIMD approach in #53171 and declined it as "needs more thought".

This seems to qualify as "more thought", but at the same time I don't think it's polished enough either. For instance, we want an arch-independent mechanism, but //go:vex seems very arch-dependent.

josharian commented 11 months ago

I am slowly coming around to the conclusion that arch-independent is a lost cause. They’re too different. And once you’re in “I need SIMD” world, you want almost every drop of performance you can get, which will often mean adapting your algorithm to the available instructions in different ways on different arches.

I wrote recently on the Gopher Slack:

TBH, at this point, all I want out of SIMD support is a glorified avo. Use build tags. Write platform-specific code using a math/bits-like interface. That’s it. That way you get maximum power out of each platform, and you don’t have to engage in useless API shenanigans to attempt to unify things that are fundamentally different.

(@klauspost made a suggestion like that at #53171.)

seebs commented 11 months ago

It seems to me that, yeah, this is at least a bit in the same direction as avo, in that the goal is not to create an abstraction layer that lets you write "portable" code, but to make it so that when you're writing unportable code, the compiler is taking care of the things that humans are bad at that make us so concerned about large assembly functions, and the only thing we're doing is jumping in and specifying a handful of specific instructions, leaving it to the compiler to sort out registers and things like that, and that seems like a very good division of labor, I think?

aclements commented 11 months ago

FWIW, I've also been coming around to a more hybrid SIMD approach that's at least partly arch-dependent. @mknyszek and I have started to sketch a proposal, but the basic idea is to have arch-dependent operations, but a shared arch-independent vector type system (in our proposal, regular Go arrays, but probably with a special element type marking them as vectors). Common, basic operations would be named the same across architectures so that, with the shared vector types, basic vector code could be written portably, but without limiting access to more advanced operations.

Jacalz commented 11 months ago

I just want to say that I quite like the Vector approach that Zig has: https://ziglang.org/documentation/0.11.0/#Vectors. Something similar might make for a good addition into Go in my opinion.

Jorropo commented 11 months ago

I would like a general approach but this is not what I'm proposing here. Something arch independent either adds lots complexity or limit ourselves to some lowest common denominator.

Something arch dependent could be complexity neutral, moving some from crypto to cmd/compile. No one is complaining that our .s files are arch dependent.

I don't like go:vex either, we could solve this by making the compiler responsible of coloring the functions. It would be slightly trickier since we might need to transition in and out of vex mode a couple times within the same function.

Jorropo commented 11 months ago

One thing I overlooked that multiple peoples brought up in the slack is that assembly isn't preempted, problems like #64417 wouldn't exists anymore.

golang / go

proposal: crypto: replace assembly implementations with internal instrinsics #64634

Proposal