Open Jorropo opened 11 months ago
cc @golang/compiler @golang/security
We've discussed a more general SIMD approach in #53171 and declined it as "needs more thought".
This seems to qualify as "more thought", but at the same time I don't think it's polished enough either. For instance, we want an arch-independent mechanism, but //go:vex
seems very arch-dependent.
I am slowly coming around to the conclusion that arch-independent is a lost cause. They’re too different. And once you’re in “I need SIMD” world, you want almost every drop of performance you can get, which will often mean adapting your algorithm to the available instructions in different ways on different arches.
I wrote recently on the Gopher Slack:
TBH, at this point, all I want out of SIMD support is a glorified avo. Use build tags. Write platform-specific code using a math/bits-like interface. That’s it. That way you get maximum power out of each platform, and you don’t have to engage in useless API shenanigans to attempt to unify things that are fundamentally different.
(@klauspost made a suggestion like that at #53171.)
It seems to me that, yeah, this is at least a bit in the same direction as avo
, in that the goal is not to create an abstraction layer that lets you write "portable" code, but to make it so that when you're writing unportable code, the compiler is taking care of the things that humans are bad at that make us so concerned about large assembly functions, and the only thing we're doing is jumping in and specifying a handful of specific instructions, leaving it to the compiler to sort out registers and things like that, and that seems like a very good division of labor, I think?
FWIW, I've also been coming around to a more hybrid SIMD approach that's at least partly arch-dependent. @mknyszek and I have started to sketch a proposal, but the basic idea is to have arch-dependent operations, but a shared arch-independent vector type system (in our proposal, regular Go arrays, but probably with a special element type marking them as vectors). Common, basic operations would be named the same across architectures so that, with the shared vector types, basic vector code could be written portably, but without limiting access to more advanced operations.
I just want to say that I quite like the Vector approach that Zig has: https://ziglang.org/documentation/0.11.0/#Vectors. Something similar might make for a good addition into Go in my opinion.
I would like a general approach but this is not what I'm proposing here. Something arch independent either adds lots complexity or limit ourselves to some lowest common denominator.
Something arch dependent could be complexity neutral, moving some from crypto
to cmd/compile
.
No one is complaining that our .s
files are arch dependent.
I don't like go:vex
either, we could solve this by making the compiler responsible of coloring the functions.
It would be slightly trickier since we might need to transition in and out of vex mode a couple times within the same function.
One thing I overlooked that multiple peoples brought up in the slack is that assembly isn't preempted, problems like #64417 wouldn't exists anymore.
There are 2X AES throughput improvements available here #42726. However the assembly implementations are too big: https://go-review.googlesource.com/c/go/+/286852/comments/1b1a6e65_4e27a3f8
From the outside, it seems the peoples able to review crypto CLs are streched thin when problems like #53142 are still open years later.
On the flip side I havn't got any problem getting things merged in the compiler, even complex-ish tricky code like bac4e2f241ca8df3d5be6ddf83214b9a681f4086 were reviewed and merged in time. We even had someone show up one day with a whole new pass, and this was merged within weeks in the same release cycle (which is hard to say about crypto things). (it was reverted but it was fixed and added back, and it's on track for release in go1.22) https://github.com/golang/go/commits/master/src/cmd/compile/internal/ssa/sccp.go
So the crypto peoples seems stretched, the compiler peoples don't look like they are. The crypto peoples don't like writing assembly, the compiler peoples write code which write assembly all day. What if we took the assembly part of crypto assembly and gave it to the compiler peoples ?
Proposal
We could create a package in
internal/intrinsics/$GOARCH/
, the point of it being ininternal
is that we don't need to get ergonomics perfect and can evolve them. It is surprisingly easy and has little complexity to wire intrinsic to the compiler (example in https://go-review.googlesource.com/c/go/+/548318, don't look at the whole CL, it's quite big because it do other things, justmemclrPointers
definition and rules).While keeping in mind a minimal compiler complexity we could implement AESENC this way, First create body-less functions:
The compiler would rely on the consumer properly checking CPUID bits. So if I call
amd64.Aesenc
there is no attempt to fallback if it's unsupported, the compiler always emitAESENC
instruction.ISimd128
would be magic types the type checker would need to know about:We could also use things like
[16]byte
or[2]uint64
directly as the simd register, however this makes the compiler more complex because it is responsible of promoting arrays to vector registers.Then in
.rules
file we can lower it:Because this is an internal package, I think it is acceptable to attribute line numbers to caller to intrinsics.
We already have a solid framework to merge theses instructions which operate on registers to ones on memory by combining a few more
.rules
andaddressingmodes.go
.regalloc
is already able to handle tricky registering situations likeAESENC
which use some register as both source and destination.Memory wise
ISimd128
type would have a 16 bytes alignment and would be usable in struct fields, so you can store it on the heap or whatever. The compiler would be able to emit simpleMOVQDA
. The complex part for the compiler is when you are allowed to do indexing inside the simd register type. Something like:In this particular case the compiler would need to know if it's better to use memory, or be smart (here staying in register land and doing
PINSERTQ
would be the best) This is solved by not being allowed to do it at all,type ISimd128 ISimd128
does not allow for indexing of inner elements. We would instead expose aShuffle64
intrinsic.For this limited usecase I don't think we need to apply generic optimizations, the goal is to be nicer to write assembly not compiler optimizer code.
For amd64 particularly all cpus (but zen4) have a cost when mixing older MMX and SSE with newer VEX and EVEX encodings. VEX and EVEX allows to use 256 bits and 512 bits registers and newer AVX, AVX2, AVX512 families of instructions. To do so we would add a new directive (probably also limit it to the
std
module):vex
marked functions would only be allowed to call other vex marked functions, this is because SSE is used in almost all functions for zeroing or copy of fixed size elements. This means avex
function couldn't call the runtime, for crypto use cases this is fine, it is customs to make the actual crypto routines use out parameters instead of allocating and returning a result.When calling a
vex
function from a nonvex
function the compiler would injectvzeroupper
after the call. This is sovex
functions can call othervex
functions and passISimd*
arguments and return values throughymm
andzmm
registers.This is different than previous solutions because it does not aim to provide a generic solution. We would still have an implementation for each architecture we want an assembly. This aims at replacing assembly, we wouldn't spend much time adding optimizations in the compiler, instead we would add access to more instructions and let people write theses optimizations themselves.
The researched gains over assembly are: