Closed oscarbg closed 2 years ago
The Apple GPU has hardware instructions for 64-bit integers, which I presume are 2x as slow* as 32-bit counterparts (at least with addition). For the 53/54-bit FP64 mantissa, you could zero-extend into 64-bit integers and add there, no add-with-carry needed. However, expected throughput will be lower because you also process exponents, function-call into the library, and check for normals (depending on whether you choose fast
or precise
).
*Why only 2x slower? You could emulate 64-bit addition in software with ~4 32-bit instructions, so hardware 64-bit math needs to be substantially faster. Thus, it should take 2 cycles.
For multiplication, you could use the 64-bit madhi/mulhi
that returns the overflow of a 64-bit add, mul, or fma. That would permit a 106-bit temporary mantissa product, allowing infinitely precise FP64 FMA or just being right-shifted back to 53 bits. This would be theoretically 4-6x slower than FP32, because you perform two multiplies instead of one, Int64 might be 2x slower than Int32, and integer multiply might take 3 cycles while FP multiply takes 2. I'm not 100% sure about any of these throughput statistics, though.
ah yes.. forgot we are talking about FP64 and not INT64 here.. anyway thanks for detailed explanation and extra information.. thanks..
@oscarbg as I just learned in metal-benchmarks, Metal performs I64 addition through emulation (4 cycles). The repo also shows other I64 math metrics and how to harness them from Metal.
LO_HALF = IADD32: (lower 32 bits)
CARRY = ICMPSEL32: LO_HALF < (either input) ? 1 : 0
HI_HALF = IADD32: (upper 32 bits)
HI_HALF = IADD32: HI_HALF + CARRY
Hi, great project.. wouldn't be possible for +/- to be only 2x slower at least if there is an "add with carry" Metal instruction? thanks..