Open arsenm opened 3 months ago
@llvm/issue-subscribers-backend-amdgpu
Author: Matt Arsenault (arsenm)
Given that v_mad_u64_u32 is as fast as v_mul_lo_u32 (at least on GFX10+) it's not clear which sequence is better. The gisel sequence is fewer instructions (ignoring the v_movs to get the inputs in the right place) but probably has longer latency because each mad depends on the previous one.
Given that v_mad_u64_u32 is as fast as v_mul_lo_u32 (at least on GFX10+) it's not clear which sequence is better. The gisel sequence is fewer instructions (ignoring the v_movs to get the inputs in the right place) but probably has longer latency because each mad depends on the previous one.
It's also using wider registers and a bigger encoding
and a bigger encoding
Huh? v_mul_lo_u32 is VOP3, just like v_mad_u64_u32.
and a bigger encoding
Huh? v_mul_lo_u32 is VOP3, just like v_mad_u64_u32.
Weird, it could/should be VOP2
64-bit and wider multiply over-uses v_mad_u64_u32.
There should only be one v_mad_u64_u32. The first one with the 0 add input is a simply mul_lo.
We have a custom lowering of multiply which tries to use v_mad_u64_u32, which is different from how the DAG handles this. The DAG relies on the default expansion and then reassembles the v_mad_u64_u32 later (and only custom lowers to select the scalar version when applicable). I was working on improving the default expansion in #97194 but that needs more work