golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.35k stars 17.58k forks source link

math/bits: add extended precision Add, Sub, Mul, Div #24813

Closed bmakam-qdt closed 5 years ago

bmakam-qdt commented 6 years ago

This is a proposal to speed up crypto/poly1305 using [u]int128 and multiword arithmetic. Arm64 has instructions that let you multiply two 64-bit registers to two 64-bit registers. It also has instructions for multiword addition. I have implemented some of these intrinsics and intrinsified them in https://go-review.googlesource.com/c/go/+/106376 which improved the performance of crypto/poly1305 by ~30% on arm64 (Amberwing). I have added these intrinsics for arm64 in poly1305 package but they might benefit other platforms as well. I am seeking advice on the design of this implementation. Is poly1305 package the right place to have these intrinsics or should they go in math/big or math/bit? This might also be a use case for #9455

FiloSottile commented 6 years ago

amd64 also has these instructions, and access to them is one of the major blockers to writing performant crypto in Go instead of assembly.

I think the best place for them would be math/bits. Can you give us a preview of the exported APIs?

/cc @vkrasnov @gtank (they probably have opinions)

mvdan commented 6 years ago

/cc @mundaym @TocarIP too, perhaps

bmakam-qdt commented 6 years ago

MulWW - this is same as math/big.mulWW AddQQ - adds two uint128. Each uint128 is represented as a pair of uint64 (hi, lo) returns uint128(hi,lo) AddQW - adds a uint128 and uint64 returns uint128(hi, lo) ShrQ44 - shiftright uint128 with constant 44 returns uint64 ShrQ42 - shiftright uint128 with constant 42 returns uint64 The arm64 assembly for these APIs are in https://go-review.googlesource.com/c/go/+/106376/3/src/vendor/golang_org/x/crypto/poly1305/asm_arm64.s ShrQ44 and ShrQ42 can be combined into one API I think.

vkrasnov commented 6 years ago

This is useful, no doubt. Intel can definitely benefit too. I'd put them all in math/big. The ShrQXX should probably be generic?

bmakam-qdt commented 6 years ago

arm64 has an API EXTRconst which can be used to write a rule like (ShrQconst [c] x y) -> EXTRconst [c] x y). If other targets also have EXTRconst this can be generic.

FiloSottile commented 6 years ago

I don't like the idea of polluting math/big. It already has a million godoc entries, but at least they all deal in Int/Float/Rat, while these would split a fixed size uint128 in hi/lo uint64.

vkrasnov commented 6 years ago

arm64 has an API EXTRconst which can be used to write a rule like (ShrQconst [c] x y) -> EXTRconst [c] x y). If other targets also have EXTRconst this can be generic.

Intel should have SHRD.

I don't like the idea of polluting math/big

Maybe create something new? like math/uint128

mvdan commented 6 years ago

Is there any chance of writing this in regular Go code without any extra APIs, and having the compiler recognise the pattern to optimise?

bmakam-qdt commented 6 years ago

are you suggesting a builtin [u]int128 type like how GCC does: https://godbolt.org/g/o6ka5e ?

FiloSottile commented 6 years ago

@mvdan Not really. The high bits of uint64 * uint64 are simply discarded. We would need uint128 in the language. (#9455)

vkrasnov commented 6 years ago

That is probably too big a change for a very niche use case. I think that less than 0.1% of Go projects would need these.

josharian commented 6 years ago

If anywhere these would probably belong in math/bits, not math/big. I think some of them were even mooted during the original math/bits API discussion. (On phone so don’t have issue link handy.)

randall77 commented 6 years ago

math/bits seems like the right place. I think you only need the following building blocks:

func Mul(x, y uint64) (uint64, uint64) // 64x64 -> 128 multiply (low, then high)
func Add(x, y uint64) (uint64, uint64) // 64x64 -> 65 add (low, then high)

(and maybe signed versions?) Everything else can be written in Go.

Even Add is not really necessary, but might make it easier for the compiler to detect the carry. (In Go, the carry is low_result < low_x || low_result < low_y or something.)

The expression you would use for a multiword shift should be directly detectable by the compiler. (It doesn't currently, but it should be straightforward.)

FiloSottile commented 6 years ago

I think AddQQ and AddQW have uint128 inputs, not uint64.

Agreed on not being useful enough to change the language. I don't hate math/uint128, but if we can boil them down to a couple API in math/bits and detecting some patterns, that's better.

cherrymui commented 6 years ago

If we have 64x64 -> 128, we might probably also have 32x32 -> 64 on 32-bit architectures. On 64-bit architectures this would be trivial.

randall77 commented 6 years ago

Can't you already get a 32x32->64 multiply on 32-bit archs? 386 can, at least:

func f(x, y uint32) (uint32, uint32) {
    z := uint64(x) * uint64(y)
    return uint32(z), uint32(z >> 32)
}

compiles to

    0x0012 00018 (/home/khr/go/tmp1.go:3)   MOVL    "".y+8(SP), AX
    0x0016 00022 (/home/khr/go/tmp1.go:3)   MOVL    "".x+4(SP), CX
    0x001a 00026 (/home/khr/go/tmp1.go:4)   MULL    CX
    0x001c 00028 (/home/khr/go/tmp1.go:4)   MOVL    AX, "".~r2+12(SP)
    0x0020 00032 (/home/khr/go/tmp1.go:4)   MOVL    DX, "".~r3+16(SP)
FiloSottile commented 6 years ago

32x32 -> 64 can be handled by pattern matching thanks to uint64. This is about filling the uint128 void.

cherrymui commented 6 years ago

You are right. 32x32->64 can already be handled.

ericlagergren commented 6 years ago

previous conversation about uint128: https://github.com/golang/go/issues/9455

TocarIP commented 6 years ago

I like an idea about exposing mulWW and divWW, which we already intrinsify, for general use, but I'm not sure about right package. IIRC there was Gophercon talk about optimizing 64x64->128 multiplication, so this looks like something useful for others. This should also help hash/fnv128, which currently does ugly stuff like this:

                 s1l := (s[1] & 0xffffffff) * prime128Lower
                 s1h := (s[1] >> 32) * prime128Lower
                 s0l := (s[0]&0xffffffff)*prime128Lower + (s[1]&0xffffffff)<<prime128Shift
                 s0h := (s[0]>>32)*prime128Lower + (s[1]>>32)<<prime128Shift
                 // Carries
                 s1h += s1l >> 32
                 s0l += s1h >> 32
                 s0h += s0l >> 32
                 // Update the values
                 s[1] = (s1l & 0xffffffff) + (s1h << 32)
                 s[0] = (s0l & 0xffffffff) + (s0h << 32)
bmakam-qdt commented 6 years ago

The expression you would use for a multiword shift should be directly detectable by the compiler. (It doesn't currently, but it should be straightforward.)

I think we might not need the intrinsic for multiword shiftright because we can create a pattern that can be detected by the compiler, something like this: ShrQconst(42, z[5], z[4]) -> (z5<<22 | z4>>42)

I think AddQQ and AddQW have uint128 inputs, not uint64.

Yes, AddQQ takes two uint128 i.e two pairs of uint64 and outputs the result to uint128(hi, lo) and AddQW takes a uint128 and uint64 and outputs the result to uint128(hi, lo).

gtank commented 6 years ago

I am strongly in favor of doing this in some form.

If it's going to be explicit functions, then math/bits or a dedicated math/{intrinsic, uint128, multiword} is a better place for it than math/big, since we care a lot about the details of the underlying representation here.

For crypto, my dream API would give me direct mappings to the basic func: uint64 x uint64 -> uint64 x uint64 functions (mulq, addq, etc for amd64), then have smart pattern matching do the carry/flag/pipeline-aware instruction optimizations when possible (think add/adc, or the newer mulx/adcx/adox). There's an example of this being a huge win in math/big already and Intel wrote a guide for how to apply those instructions to patterns that are common in crypto code.

gopherbot commented 6 years ago

Change https://golang.org/cl/106376 mentions this issue: cmd/compile: add intrinsics for multiword arithmetic

bmakam-qdt commented 6 years ago

Sorry, I jumped the gun and assumed we have a consensus on math/bits as the right package for the APIs and took a stab at it in CL106376. My sincere apologies.

jimmyfrasche commented 6 years ago

I strongly disagree about putting these in math/bits: these are not operations on bits. They may involve operations on bits, but ultimately so does everything.

I really do not want to see math/bits turn into a junk drawer.

The proposed operations are a new class of operations and they deserve their own package for ease of discovery, to make reading code that uses these operations clear, and so that they can have focused documentation.

randall77 commented 6 years ago

I don't see a need for both AddQW and AddQQ. AddQW(x1, x0, y) = AddQQ(x1, x0, 0, y). The compiler should have no problem constant folding the 0. Just have a single Add. In any case, I don't like Q and W. What do they stand for? (quadword and word? That makes no sense here.)

josharian commented 6 years ago

The proposed operations are a new class of operations and they deserve their own package

math/medium? :)

cherrymui commented 6 years ago

I agree with @jimmyfrasche.

I think math/big makes some sense. They are "big" integers, just not arbitrarily big, but still bigger than uint64.

FiloSottile commented 6 years ago

Agreed on just having Add (and Mul) and detecting constant 0. Let's not leak the instruction names.

They definitely don't fit in math/big. I see the concern for math/bits becoming math/intrinsicutils, but I am not convinced that any other package name would grow past two functions, or have the same issue. And this is about multiplication of high and low bits, even if it's a stretch.

FiloSottile commented 6 years ago

BTW, @bmakam-qdt thanks for doing the prototyping work. It's very valuable even if the proposal is not finalized yet!

jimmyfrasche commented 6 years ago

@josharian I didn't say coming up with a good name would be easy—just that it's the right thing to do :þ

Can methods be compiler intrinsics? A lot of the naming and documentation could be simplified if it were something like

type Uint struct {
  Hi, Lo uint64
}
func (a Uint) Mul(b Uint) Uint 

Maybe then math/math128? The import stutters a bit and the name is slightly unwieldy but it would only be needed for the types and related factories. It could contain math128.Uint, math128.Int, and perhaps even math128.Float.

That's not a great name but hopefully it spurs a better one.

ianlancetaylor commented 6 years ago

These can't be in math/big. Personally I think math/bits is OK but I respect the objections. math/math128 is not bad but sounds to me like a place for the 128-bit integer types, which isn't quite what this is. It seems to me that we are going for is basically double word operations. Unfortunately C suggests that double is a floating point value. Perhaps math/wide? wide.Mul?

gtank commented 6 years ago

FWIW, I also think they're not a good fit for math/bits since they're about integer semantics rather than the literal bits.

What does everyone think of math/intrinsic with the chance to handle future needs of this type there too?

package intrinsic

type Uint128 struct {
 Hi, Lo uint64
}
func Mul(a, b uint64) Uint128
josharian commented 6 years ago

Perhaps math/wide? wide.Mul?

I like math/wide; it does double duty, because (a) the types are wide and (b) it evokes "widening arithmetic".

Is there any danger of wanting 256+ bit arithmetic anytime soon? If so, we might want Mul128. Or maybe Mul64, since the inputs are 64 bit?

cc @kortschak @btracey in case they want to weigh in from a scientific computing perspective. (This has all been crypto-oriented so far.)

TocarIP commented 6 years ago

I don't like math/intrinsic it exposes implementation details of gc and by that logic we will need to move most of math/bits to math/intrinsic because we already intrinsify most of it.

On an unrelated note what about 128-bit subtraction? Just express it as an add with -x?

FiloSottile commented 6 years ago

I'm not sure the Uint128 type would let the kind of code this is for flow well. What is now x.Hi can later be something else and now you have to assign it a new name. And it might not play well with SSA (arrays don't). Let's not overcomplicate it, this is for so-low-level-that-it-was-going-to-be-assembly code.

No on math/intrinsic too, they won't be intrinsified everywhere, and it's a "utils"-like name.

josharian commented 6 years ago

(Another advantage to math/wide is that it makes it obvious where saturating arithmetic operations live, should we decide to add them later: math/saturate or maybe just math/sat.)

btracey commented 6 years ago

Looping in @kunde21 and @vladimir-ch because they have spent a bunch of time with our assembly implementations. I believe the short comment is that we would really love a way to easily use SSE / AVX (for float64). This is partly due to the direct speedups from using the types, but also that it would be great for the compiler to know these instructions so inlining can happen within the context of bigger blocks.

jimmyfrasche commented 6 years ago

The biggest issues with math/wide is someone looking at the package index for the first time and thinking "wait, what's wide and what's big?" but the doc synopsis would cover that and that's good enough for me.

My main point with introducing the types is that it would allow the operations on those types to be namespaced so there wouldn't need to be MulU128, MullI128, MulF128, etc.—they could just be T.Mul(T) for T in {Int, Uint, Float}.

bcmills commented 6 years ago

For naming, maybe math/split (as in, “a uint128 split into two uint64s”)?

But it seems like if this is worthwhile, we should just do #9455 instead. Intrinsics would require compiler support either way, and a built-in type seems much more ergonomic (less risk of accidental transposition, and more similarity to integer operations at other sizes).

ericlagergren commented 6 years ago

Mul128, type Uint128 struct{Hi, Lo uint64}, and such seem like a poor substitute for a proper uint128 type. I mean, if it's being intrinsified the compiler is already recognizing the routine, so I fail to see the difference between that and adding uint128 as a builtin. That could also be my ignorance showing, though.

rsc commented 6 years ago

this is for so-low-level-that-it-was-going-to-be-assembly code.

How much faster is the assembly code? Usually in these packages writing assembly (for larger sequences) tends to expose other optimization opportunities as well.

bmakam-qdt commented 6 years ago

@rsc This improved the performance of crypto/poly1305 by ~30% on arm64 (Amberwing). There is more optimization opportunity if we use neon/simd instructions.

rsc commented 6 years ago

I could see Add64 (64,64 -> 64,1) and Mul64 (64,64 -> 64,64), but beyond that I don't see much that's generally important. Hard-coding higher-level things like 64,64,64,64 -> 64,64 for 128-bit add seems like going too far.

My question was not how much faster it is to have intrinsics. It was how much additional speed you get from assembly compared to limited intrinsics.

bmakam-qdt commented 6 years ago

Agreed on just having Add64 and Mul64.

how much additional speed you get from assembly compared to limited intrinsics.

Not sure if I understood this correctly, but if we don't intrinsify these intrinsics, the assembly version is 50% faster.

rsc commented 6 years ago

Based on discussion with @agl and @FiloSottile:

The argument for adding these is not so much for the mainline architectures where we have people to write assembly for larger operations, but instead for the less popular architectures, where intrinsics let us use portable code that executes faster. It sounds like the basic operations are:

and

For shift it seems like pattern-matching is fine: L/R Shift 128/n -> 64 for example is(x<<n) | (y>>(64-n))

It would be nice to have, say, three specific algorithms to use as test cases and make sure that this set is useful. We don't want to add every last extra thing, but we do want to make sure it covers enough to help.

If someone can identify those algorithms, I'm OK with accepting this.

These would go in math/bits.

gtank commented 6 years ago

One good example is x/crypto/ed25519. It's currently a 32-bit implementation on all platforms. I have a 64-bit version with both pure Go and assembly math, and the mul64 emulation overhead makes the Go slower than the existing code. That should not be true, and intrinsic calls should fix it.

FiloSottile commented 6 years ago

Also based on discussion with @rsc, these are going in math/bits.

bmakam-qdt commented 6 years ago

x/crypto/poly1305 is another example that covers Mul64, Add64 and L/R Shift 128/n -> 64

bmkessler commented 6 years ago

@rsc What is the rationale for not including the uint-sized versions of Mul and Div as well? Those are both internal primitives in math/big.

// A Word represents a single digit of a multi-precision unsigned integer.
type Word uint
func mulWW(x, y Word) (z1, z0 Word)
func divWW(x1, x0, y Word) (q, r Word)