Closed bjacob closed 3 weeks ago
@llvm/issue-subscribers-backend-amdgpu
Author: Benoit Jacob (bjacob)
cc @arsenm @jayfoad this appears to be due to LowerUDIVREM64
, is this expected behavior?
cc @arsenm @jayfoad this appears to be due to
LowerUDIVREM64
, is this expected behavior?
I'd expect all of the divide-by-constant cases to get caught by the generic combiner handling of them and never reach the custom lowering
I'd expect all of the divide-by-constant cases to get caught by the generic combiner handling of them and never reach the custom lowering
Agreed. E.g. NVPTX lowers those division/modulo ops to multiplication by reciprocal: https://godbolt.org/z/MWo9hsMaW
cc @arsenm @jayfoad this appears to be due to
LowerUDIVREM64
, is this expected behavior?I'd expect all of the divide-by-constant cases to get caught by the generic combiner handling of them and never reach the custom lowering
Divide by constant might be blocked by MULHS and SMUL_LOHI not being legal.
cc @arsenm @jayfoad this appears to be due to
LowerUDIVREM64
, is this expected behavior?I'd expect all of the divide-by-constant cases to get caught by the generic combiner handling of them and never reach the custom lowering
Divide by constant might be blocked by MULHS and SMUL_LOHI not being legal.
That seems to be the case, and it explains why we don't get that much code when using i32 (which is legal for MULHS).
I think there are two ways to enable this transform for i64:
The first one is nice because it's a target fix with less chances of causing side effects, but it forces a change to TLI. The second one is contained to the AMDGPU backend but might result in i64 MULHS (or MUL_LOHI) being generated in other places where it may be less desirable.
Thoughts @arsenm ?
Can we just change or remove the legality check?
The fix #100723 was temporarily reverted as it broke ARM (ISD::SRL for MVT::v4i64 is TypeSplitVector, triggering an assertion failure).
This is observed with
-xhip
targeting AMD MI300 (gfx942
).Compiler Explorer link: https://godbolt.org/z/xrfhhaaeY. For completeness, the clang flags are
-O3 --cuda-device-only -x hip -nogpuinc -nogpulib --offload-arch=gfx942
.Testcase:
This compiles to 80 instructions.
By contrast, the same testcase with
int64_t
replaced byint32_t
compiles to just 8 instructions.I was expecting the
int64
variant to generate slightly over 2x more instructions than theint32
variant (since the target requires rewritingint64
ops into pairs ofint32
ops). Not 10x.The above Compiler Explorer link shows the same happening with
i / 3
instead ofi % 3
.