Closed SChernykh closed 5 years ago
AFAIK, this is only true for very old cuda compute versions, thus irrelevant for mining.
V_MUL_LO_U32, V_MUL_HI_U32 = 16 clock cycles on AMD Vega and RX GPUs V_MUL_U32_U24 = 4 clock cycles on AMD
This is an intentional tradeoff for simplicity. Note that for mining latency doesn't matter, it's all about throughput.
AMD Vega and Polaris have V_MUL_LO_U32 and V_MUL_U32_U24. The 32-bit multiply is 4x longer latency than the 24-bit multiply. I haven't been able to find what the throughput difference is.
Nvidia Pascal has XMAD, which is 16-bit input, 32-bit output multiply-add. A 3 instruction sequence is used to do a full 32-bit multiply: https://groups.google.com/d/msg/maxas-discuss/4rovrjSRzKA/_UCuK9U2BAAJ
There's no full-speed, single-instruction common denominator between the two. Doing either 24-bit or 16-bit multiplies would be full speed on one architecture and require extra emulation code on the other architecture. Using normal 32-bit multiplies is a good compromise that's simple and works reasonably on both architectures.
This is one of the small optimizations that a theoretical ProgPoW ASIC could do to be 10-20% more efficient than existing GPUs. The inner ProgPoW loop expands to about 200 instructions on Pascal. This optimization on average would save <10 instructions.
Digging into the profiling data behind our Understanding ProgPoW blog post I just noticed that Turing appears to have a full-speed IMAD instruction. This allows the inner loop to be about 150 instructions, mostly because the *33 from the merge() takes a single IMAD instead of 2 XMADs.
So a ProgPoW ASIC with this optimization would simply match a GPU you can buy today.
Closing issue since this isn't going to be changed.
An ASIC will be 4 times more efficient with these two operations because a, b are 32-bit integers:
32-bit integer multiplications are inefficient on GPUs because GPUs only have 24-bit wide data path for multiplication. 32-bit MUL is 4 times slower than 24-bit MUL. It's better to use mul24 here.
Side note: it's a shame that OpenCL still doesn't have mul24_hi, but CUDA has it.