Closed bmakam-qdt closed 5 years ago
amd64 also has these instructions, and access to them is one of the major blockers to writing performant crypto in Go instead of assembly.
I think the best place for them would be math/bits. Can you give us a preview of the exported APIs?
/cc @vkrasnov @gtank (they probably have opinions)
/cc @mundaym @TocarIP too, perhaps
MulWW - this is same as math/big.mulWW AddQQ - adds two uint128. Each uint128 is represented as a pair of uint64 (hi, lo) returns uint128(hi,lo) AddQW - adds a uint128 and uint64 returns uint128(hi, lo) ShrQ44 - shiftright uint128 with constant 44 returns uint64 ShrQ42 - shiftright uint128 with constant 42 returns uint64 The arm64 assembly for these APIs are in https://go-review.googlesource.com/c/go/+/106376/3/src/vendor/golang_org/x/crypto/poly1305/asm_arm64.s ShrQ44 and ShrQ42 can be combined into one API I think.
This is useful, no doubt. Intel can definitely benefit too. I'd put them all in math/big. The ShrQXX should probably be generic?
arm64 has an API EXTRconst which can be used to write a rule like (ShrQconst [c] x y) -> EXTRconst [c] x y). If other targets also have EXTRconst this can be generic.
I don't like the idea of polluting math/big. It already has a million godoc entries, but at least they all deal in Int/Float/Rat, while these would split a fixed size uint128 in hi/lo uint64.
arm64 has an API EXTRconst which can be used to write a rule like (ShrQconst [c] x y) -> EXTRconst [c] x y). If other targets also have EXTRconst this can be generic.
Intel should have SHRD.
I don't like the idea of polluting math/big
Maybe create something new? like math/uint128
Is there any chance of writing this in regular Go code without any extra APIs, and having the compiler recognise the pattern to optimise?
are you suggesting a builtin [u]int128 type like how GCC does: https://godbolt.org/g/o6ka5e ?
@mvdan Not really. The high bits of uint64 * uint64 are simply discarded. We would need uint128 in the language. (#9455)
That is probably too big a change for a very niche use case. I think that less than 0.1% of Go projects would need these.
If anywhere these would probably belong in math/bits, not math/big. I think some of them were even mooted during the original math/bits API discussion. (On phone so don’t have issue link handy.)
math/bits
seems like the right place.
I think you only need the following building blocks:
func Mul(x, y uint64) (uint64, uint64) // 64x64 -> 128 multiply (low, then high)
func Add(x, y uint64) (uint64, uint64) // 64x64 -> 65 add (low, then high)
(and maybe signed versions?) Everything else can be written in Go.
Even Add
is not really necessary, but might make it easier for the compiler to detect the carry. (In Go, the carry is low_result < low_x || low_result < low_y
or something.)
The expression you would use for a multiword shift should be directly detectable by the compiler. (It doesn't currently, but it should be straightforward.)
I think AddQQ and AddQW have uint128 inputs, not uint64.
Agreed on not being useful enough to change the language. I don't hate math/uint128, but if we can boil them down to a couple API in math/bits and detecting some patterns, that's better.
If we have 64x64 -> 128, we might probably also have 32x32 -> 64 on 32-bit architectures. On 64-bit architectures this would be trivial.
Can't you already get a 32x32->64 multiply on 32-bit archs? 386 can, at least:
func f(x, y uint32) (uint32, uint32) {
z := uint64(x) * uint64(y)
return uint32(z), uint32(z >> 32)
}
compiles to
0x0012 00018 (/home/khr/go/tmp1.go:3) MOVL "".y+8(SP), AX
0x0016 00022 (/home/khr/go/tmp1.go:3) MOVL "".x+4(SP), CX
0x001a 00026 (/home/khr/go/tmp1.go:4) MULL CX
0x001c 00028 (/home/khr/go/tmp1.go:4) MOVL AX, "".~r2+12(SP)
0x0020 00032 (/home/khr/go/tmp1.go:4) MOVL DX, "".~r3+16(SP)
32x32 -> 64 can be handled by pattern matching thanks to uint64. This is about filling the uint128 void.
You are right. 32x32->64 can already be handled.
previous conversation about uint128: https://github.com/golang/go/issues/9455
I like an idea about exposing mulWW and divWW, which we already intrinsify, for general use, but I'm not sure about right package. IIRC there was Gophercon talk about optimizing 64x64->128 multiplication, so this looks like something useful for others. This should also help hash/fnv128, which currently does ugly stuff like this:
s1l := (s[1] & 0xffffffff) * prime128Lower
s1h := (s[1] >> 32) * prime128Lower
s0l := (s[0]&0xffffffff)*prime128Lower + (s[1]&0xffffffff)<<prime128Shift
s0h := (s[0]>>32)*prime128Lower + (s[1]>>32)<<prime128Shift
// Carries
s1h += s1l >> 32
s0l += s1h >> 32
s0h += s0l >> 32
// Update the values
s[1] = (s1l & 0xffffffff) + (s1h << 32)
s[0] = (s0l & 0xffffffff) + (s0h << 32)
The expression you would use for a multiword shift should be directly detectable by the compiler. (It doesn't currently, but it should be straightforward.)
I think we might not need the intrinsic for multiword shiftright because we can create a pattern that can be detected by the compiler, something like this:
ShrQconst(42, z[5], z[4]) -> (z5<<22 | z4>>42)
I think AddQQ and AddQW have uint128 inputs, not uint64.
Yes, AddQQ takes two uint128 i.e two pairs of uint64 and outputs the result to uint128(hi, lo) and AddQW takes a uint128 and uint64 and outputs the result to uint128(hi, lo).
I am strongly in favor of doing this in some form.
If it's going to be explicit functions, then math/bits or a dedicated math/{intrinsic, uint128, multiword} is a better place for it than math/big, since we care a lot about the details of the underlying representation here.
For crypto, my dream API would give me direct mappings to the basic func: uint64 x uint64 -> uint64 x uint64
functions (mulq, addq, etc for amd64), then have smart pattern matching do the carry/flag/pipeline-aware instruction optimizations when possible (think add/adc, or the newer mulx/adcx/adox). There's an example of this being a huge win in math/big already and Intel wrote a guide for how to apply those instructions to patterns that are common in crypto code.
Change https://golang.org/cl/106376 mentions this issue: cmd/compile: add intrinsics for multiword arithmetic
Sorry, I jumped the gun and assumed we have a consensus on math/bits as the right package for the APIs and took a stab at it in CL106376. My sincere apologies.
I strongly disagree about putting these in math/bits: these are not operations on bits. They may involve operations on bits, but ultimately so does everything.
I really do not want to see math/bits turn into a junk drawer.
The proposed operations are a new class of operations and they deserve their own package for ease of discovery, to make reading code that uses these operations clear, and so that they can have focused documentation.
I don't see a need for both AddQW
and AddQQ
. AddQW(x1, x0, y) = AddQQ(x1, x0, 0, y)
. The compiler should have no problem constant folding the 0
. Just have a single Add
.
In any case, I don't like Q
and W
. What do they stand for? (quadword and word? That makes no sense here.)
The proposed operations are a new class of operations and they deserve their own package
math/medium
? :)
I agree with @jimmyfrasche.
I think math/big makes some sense. They are "big" integers, just not arbitrarily big, but still bigger than uint64.
Agreed on just having Add
(and Mul
) and detecting constant 0. Let's not leak the instruction names.
They definitely don't fit in math/big. I see the concern for math/bits becoming math/intrinsicutils, but I am not convinced that any other package name would grow past two functions, or have the same issue. And this is about multiplication of high and low bits, even if it's a stretch.
BTW, @bmakam-qdt thanks for doing the prototyping work. It's very valuable even if the proposal is not finalized yet!
@josharian I didn't say coming up with a good name would be easy—just that it's the right thing to do :þ
Can methods be compiler intrinsics? A lot of the naming and documentation could be simplified if it were something like
type Uint struct {
Hi, Lo uint64
}
func (a Uint) Mul(b Uint) Uint
Maybe then math/math128
? The import stutters a bit and the name is slightly unwieldy but it would only be needed for the types and related factories. It could contain math128.Uint
, math128.Int
, and perhaps even math128.Float
.
That's not a great name but hopefully it spurs a better one.
These can't be in math/big. Personally I think math/bits is OK but I respect the objections. math/math128 is not bad but sounds to me like a place for the 128-bit integer types, which isn't quite what this is. It seems to me that we are going for is basically double word operations. Unfortunately C suggests that double
is a floating point value. Perhaps math/wide? wide.Mul
?
FWIW, I also think they're not a good fit for math/bits
since they're about integer semantics rather than the literal bits.
What does everyone think of math/intrinsic
with the chance to handle future needs of this type there too?
package intrinsic
type Uint128 struct {
Hi, Lo uint64
}
func Mul(a, b uint64) Uint128
Perhaps math/wide? wide.Mul?
I like math/wide
; it does double duty, because (a) the types are wide and (b) it evokes "widening arithmetic".
Is there any danger of wanting 256+ bit arithmetic anytime soon? If so, we might want Mul128
. Or maybe Mul64
, since the inputs are 64 bit?
cc @kortschak @btracey in case they want to weigh in from a scientific computing perspective. (This has all been crypto-oriented so far.)
I don't like math/intrinsic it exposes implementation details of gc and by that logic we will need to move most of math/bits to math/intrinsic because we already intrinsify most of it.
On an unrelated note what about 128-bit subtraction? Just express it as an add with -x?
I'm not sure the Uint128 type would let the kind of code this is for flow well. What is now x.Hi
can later be something else and now you have to assign it a new name. And it might not play well with SSA (arrays don't). Let's not overcomplicate it, this is for so-low-level-that-it-was-going-to-be-assembly code.
No on math/intrinsic too, they won't be intrinsified everywhere, and it's a "utils"-like name.
(Another advantage to math/wide
is that it makes it obvious where saturating arithmetic operations live, should we decide to add them later: math/saturate
or maybe just math/sat
.)
Looping in @kunde21 and @vladimir-ch because they have spent a bunch of time with our assembly implementations. I believe the short comment is that we would really love a way to easily use SSE / AVX (for float64
). This is partly due to the direct speedups from using the types, but also that it would be great for the compiler to know these instructions so inlining can happen within the context of bigger blocks.
The biggest issues with math/wide is someone looking at the package index for the first time and thinking "wait, what's wide and what's big?" but the doc synopsis would cover that and that's good enough for me.
My main point with introducing the types is that it would allow the operations on those types to be namespaced so there wouldn't need to be MulU128, MullI128, MulF128, etc.—they could just be T.Mul(T) for T in {Int, Uint, Float}.
For naming, maybe math/split
(as in, “a uint128 split into two uint64s”)?
But it seems like if this is worthwhile, we should just do #9455 instead. Intrinsics would require compiler support either way, and a built-in type seems much more ergonomic (less risk of accidental transposition, and more similarity to integer operations at other sizes).
Mul128
, type Uint128 struct{Hi, Lo uint64}
, and such seem like a poor substitute for a proper uint128 type. I mean, if it's being intrinsified the compiler is already recognizing the routine, so I fail to see the difference between that and adding uint128
as a builtin. That could also be my ignorance showing, though.
this is for so-low-level-that-it-was-going-to-be-assembly code.
How much faster is the assembly code? Usually in these packages writing assembly (for larger sequences) tends to expose other optimization opportunities as well.
@rsc This improved the performance of crypto/poly1305 by ~30% on arm64 (Amberwing). There is more optimization opportunity if we use neon/simd instructions.
I could see Add64 (64,64 -> 64,1) and Mul64 (64,64 -> 64,64), but beyond that I don't see much that's generally important. Hard-coding higher-level things like 64,64,64,64 -> 64,64 for 128-bit add seems like going too far.
My question was not how much faster it is to have intrinsics. It was how much additional speed you get from assembly compared to limited intrinsics.
Agreed on just having Add64 and Mul64.
how much additional speed you get from assembly compared to limited intrinsics.
Not sure if I understood this correctly, but if we don't intrinsify these intrinsics, the assembly version is 50% faster.
Based on discussion with @agl and @FiloSottile:
The argument for adding these is not so much for the mainline architectures where we have people to write assembly for larger operations, but instead for the less popular architectures, where intrinsics let us use portable code that executes faster. It sounds like the basic operations are:
and
For shift it seems like pattern-matching is fine: L/R Shift 128/n -> 64 for example is(x<<n) | (y>>(64-n))
It would be nice to have, say, three specific algorithms to use as test cases and make sure that this set is useful. We don't want to add every last extra thing, but we do want to make sure it covers enough to help.
If someone can identify those algorithms, I'm OK with accepting this.
These would go in math/bits.
One good example is x/crypto/ed25519. It's currently a 32-bit implementation on all platforms. I have a 64-bit version with both pure Go and assembly math, and the mul64 emulation overhead makes the Go slower than the existing code. That should not be true, and intrinsic calls should fix it.
Also based on discussion with @rsc, these are going in math/bits.
x/crypto/poly1305 is another example that covers Mul64, Add64 and L/R Shift 128/n -> 64
@rsc What is the rationale for not including the uint-sized versions of Mul and Div as well? Those are both internal primitives in math/big.
// A Word represents a single digit of a multi-precision unsigned integer.
type Word uint
func mulWW(x, y Word) (z1, z0 Word)
func divWW(x1, x0, y Word) (q, r Word)
This is a proposal to speed up crypto/poly1305 using [u]int128 and multiword arithmetic. Arm64 has instructions that let you multiply two 64-bit registers to two 64-bit registers. It also has instructions for multiword addition. I have implemented some of these intrinsics and intrinsified them in https://go-review.googlesource.com/c/go/+/106376 which improved the performance of crypto/poly1305 by ~30% on arm64 (Amberwing). I have added these intrinsics for arm64 in poly1305 package but they might benefit other platforms as well. I am seeking advice on the design of this implementation. Is poly1305 package the right place to have these intrinsics or should they go in math/big or math/bit? This might also be a use case for #9455