Inefficient integer multiplications

SChernykh commented 5 years ago

An ASIC will be 4 times more efficient with these two operations because a, b are 32-bit integers:

    case 1: return a * b;
    case 2: return mul_hi(a, b);

32-bit integer multiplications are inefficient on GPUs because GPUs only have 24-bit wide data path for multiplication. 32-bit MUL is 4 times slower than 24-bit MUL. It's better to use mul24 here.

Side note: it's a shame that OpenCL still doesn't have mul24_hi, but CUDA has it.

pallas1 commented 5 years ago

AFAIK, this is only true for very old cuda compute versions, thus irrelevant for mining.

SChernykh commented 5 years ago

V_MUL_LO_U32, V_MUL_HI_U32 = 16 clock cycles on AMD Vega and RX GPUs V_MUL_U32_U24 = 4 clock cycles on AMD

ifdefelse commented 5 years ago

This is an intentional tradeoff for simplicity. Note that for mining latency doesn't matter, it's all about throughput.

AMD Vega and Polaris have V_MUL_LO_U32 and V_MUL_U32_U24. The 32-bit multiply is 4x longer latency than the 24-bit multiply. I haven't been able to find what the throughput difference is.

Nvidia Pascal has XMAD, which is 16-bit input, 32-bit output multiply-add. A 3 instruction sequence is used to do a full 32-bit multiply: https://groups.google.com/d/msg/maxas-discuss/4rovrjSRzKA/_UCuK9U2BAAJ

There's no full-speed, single-instruction common denominator between the two. Doing either 24-bit or 16-bit multiplies would be full speed on one architecture and require extra emulation code on the other architecture. Using normal 32-bit multiplies is a good compromise that's simple and works reasonably on both architectures.

This is one of the small optimizations that a theoretical ProgPoW ASIC could do to be 10-20% more efficient than existing GPUs. The inner ProgPoW loop expands to about 200 instructions on Pascal. This optimization on average would save <10 instructions.

ifdefelse commented 5 years ago

Digging into the profiling data behind our Understanding ProgPoW blog post I just noticed that Turing appears to have a full-speed IMAD instruction. This allows the inner loop to be about 150 instructions, mostly because the *33 from the merge() takes a single IMAD instead of 2 XMADs.

So a ProgPoW ASIC with this optimization would simply match a GPU you can buy today.

ifdefelse commented 5 years ago

Closing issue since this isn't going to be changed.

ifdefelse / ProgPOW

Inefficient integer multiplications #16