llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.85k stars 11.47k forks source link

[AMDGPU] `int64` modulo-constant `x % 3` and divide-by-constant `x / 3` compile to 80 instructions. #100383

Closed bjacob closed 3 weeks ago

bjacob commented 1 month ago

This is observed with -xhip targeting AMD MI300 (gfx942).

Compiler Explorer link: https://godbolt.org/z/xrfhhaaeY. For completeness, the clang flags are -O3 --cuda-device-only -x hip -nogpuinc -nogpulib --offload-arch=gfx942.

Testcase:

__attribute__((device))
int64_t a(int64_t i) {
    return i % 3;
}

This compiles to 80 instructions.

By contrast, the same testcase with int64_t replaced by int32_t compiles to just 8 instructions.

I was expecting the int64 variant to generate slightly over 2x more instructions than the int32 variant (since the target requires rewriting int64 ops into pairs of int32 ops). Not 10x.

The above Compiler Explorer link shows the same happening with i / 3 instead of i % 3.

llvmbot commented 1 month ago

@llvm/issue-subscribers-backend-amdgpu

Author: Benoit Jacob (bjacob)

This is observed with `-xhip` targeting AMD MI300 (`gfx942`). Compiler Explorer link: https://godbolt.org/z/xrfhhaaeY. For completeness, the clang flags are `-O3 --cuda-device-only -x hip -nogpuinc -nogpulib --offload-arch=gfx942`. Testcase: ```c++ __attribute__((device)) int64_t a(int64_t i) { return i % 3; } ``` This compiles to 80 instructions. By contrast, the same testcase with `int64_t` replaced by `int32_t` compiles to just 8 instructions. I was expecting the `int64` variant to generate slightly over 2x more instructions than the `int32` variant (since the target requires rewriting `int64` ops into pairs of `int32` ops). Not 10x. The above Compiler Explorer link shows the same happening with `i / 3` instead of `i % 3`.
Pierre-vh commented 1 month ago

cc @arsenm @jayfoad this appears to be due to LowerUDIVREM64, is this expected behavior?

arsenm commented 1 month ago

cc @arsenm @jayfoad this appears to be due to LowerUDIVREM64, is this expected behavior?

I'd expect all of the divide-by-constant cases to get caught by the generic combiner handling of them and never reach the custom lowering

Artem-B commented 1 month ago

I'd expect all of the divide-by-constant cases to get caught by the generic combiner handling of them and never reach the custom lowering

Agreed. E.g. NVPTX lowers those division/modulo ops to multiplication by reciprocal: https://godbolt.org/z/MWo9hsMaW

topperc commented 1 month ago

cc @arsenm @jayfoad this appears to be due to LowerUDIVREM64, is this expected behavior?

I'd expect all of the divide-by-constant cases to get caught by the generic combiner handling of them and never reach the custom lowering

Divide by constant might be blocked by MULHS and SMUL_LOHI not being legal.

Pierre-vh commented 1 month ago

cc @arsenm @jayfoad this appears to be due to LowerUDIVREM64, is this expected behavior?

I'd expect all of the divide-by-constant cases to get caught by the generic combiner handling of them and never reach the custom lowering

Divide by constant might be blocked by MULHS and SMUL_LOHI not being legal.

That seems to be the case, and it explains why we don't get that much code when using i32 (which is legal for MULHS).

I think there are two ways to enable this transform for i64:

The first one is nice because it's a target fix with less chances of causing side effects, but it forces a change to TLI. The second one is contained to the AMDGPU backend but might result in i64 MULHS (or MUL_LOHI) being generated in other places where it may be less desirable.

Thoughts @arsenm ?

arsenm commented 1 month ago

Can we just change or remove the legality check?

MaskRay commented 1 month ago

The fix #100723 was temporarily reverted as it broke ARM (ISD::SRL for MVT::v4i64 is TypeSplitVector, triggering an assertion failure).