philipturner / metal-float64

Emulating double-precision arithmetic on Apple GPUs
MIT License
44 stars 2 forks source link

Question: Possible additions/subtractions only 2x slower? #1

Closed oscarbg closed 2 years ago

oscarbg commented 2 years ago

Hi, great project.. wouldn't be possible for +/- to be only 2x slower at least if there is an "add with carry" Metal instruction? thanks..

philipturner commented 2 years ago

The Apple GPU has hardware instructions for 64-bit integers, which I presume are 2x as slow* as 32-bit counterparts (at least with addition). For the 53/54-bit FP64 mantissa, you could zero-extend into 64-bit integers and add there, no add-with-carry needed. However, expected throughput will be lower because you also process exponents, function-call into the library, and check for normals (depending on whether you choose fast or precise).

*Why only 2x slower? You could emulate 64-bit addition in software with ~4 32-bit instructions, so hardware 64-bit math needs to be substantially faster. Thus, it should take 2 cycles.

For multiplication, you could use the 64-bit madhi/mulhi that returns the overflow of a 64-bit add, mul, or fma. That would permit a 106-bit temporary mantissa product, allowing infinitely precise FP64 FMA or just being right-shifted back to 53 bits. This would be theoretically 4-6x slower than FP32, because you perform two multiplies instead of one, Int64 might be 2x slower than Int32, and integer multiply might take 3 cycles while FP multiply takes 2. I'm not 100% sure about any of these throughput statistics, though.

oscarbg commented 2 years ago

ah yes.. forgot we are talking about FP64 and not INT64 here.. anyway thanks for detailed explanation and extra information.. thanks..

philipturner commented 1 year ago

@oscarbg as I just learned in metal-benchmarks, Metal performs I64 addition through emulation (4 cycles). The repo also shows other I64 math metrics and how to harness them from Metal.

LO_HALF = IADD32: (lower 32 bits)
CARRY = ICMPSEL32: LO_HALF < (either input) ? 1 : 0
HI_HALF = IADD32: (upper 32 bits)
HI_HALF = IADD32: HI_HALF + CARRY