[AMDGPU] generates v_cndmask/lshlrev for uniform select between 0 and a power of 2

Engininja2 commented 5 months ago

Here is my test kernel:

static __global__ void test(const int32_t input, int32_t * result) {
    const int data = input ? 32 : 0;
    int output;

    asm volatile("s_mov_b32 %0, %1" : "=s"(output) : "s"(data));

    result[threadIdx.x] = output;
}

and here's the asm it compiles to for gfx900 on rocm 6.0.0, where it fails to assemble because of an invalid operand.

    s_load_dword s2, s[4:5], 0x0
    s_load_dwordx2 s[0:1], s[4:5], 0x8
    v_lshlrev_b32_e32 v0, 2, v0
    s_waitcnt lgkmcnt(0)
    s_cmp_lg_u32 s2, 0
    s_cselect_b64 s[2:3], -1, 0
    v_cndmask_b32_e64 v1, 0, 1, s[2:3]
    v_lshlrev_b32_e32 v1, 5, v1
    ;;#ASMSTART
    s_mov_b32 s2, v1
    ;;#ASMEND
    v_mov_b32_e32 v1, s2
    global_store_dword v0, v1, s[0:1]

If either of the values being selected is changed so that either one isn't 0, or the other isn't a power of 2, instead it compiles to this, which works.

    s_load_dword s2, s[4:5], 0x0
    s_load_dwordx2 s[0:1], s[4:5], 0x8
    v_lshlrev_b32_e32 v0, 2, v0
    s_waitcnt lgkmcnt(0)
    s_cmp_eq_u32 s2, 0
    s_cselect_b32 s2, 0, 33
    ;;#ASMSTART
    s_mov_b32 s2, s2
    ;;#ASMEND
    v_mov_b32_e32 v1, s2
    global_store_dword v0, v1, s[0:1]

llvmbot commented 5 months ago

@llvm/issue-subscribers-backend-amdgpu

Author: None (Engininja2)

Here is my test kernel: ```cpp static __global__ void test(const int32_t input, int32_t * result) { const int data = input ? 32 : 0; int output; asm volatile("s_mov_b32 %0, %1" : "=s"(output) : "s"(data)); result[threadIdx.x] = output; } ``` and here's the asm it compiles to for gfx900 on rocm 6.0.0, where it fails to assemble because of an invalid operand. ```asm s_load_dword s2, s[4:5], 0x0 s_load_dwordx2 s[0:1], s[4:5], 0x8 v_lshlrev_b32_e32 v0, 2, v0 s_waitcnt lgkmcnt(0) s_cmp_lg_u32 s2, 0 s_cselect_b64 s[2:3], -1, 0 v_cndmask_b32_e64 v1, 0, 1, s[2:3] v_lshlrev_b32_e32 v1, 5, v1 ;;#ASMSTART s_mov_b32 s2, v1 ;;#ASMEND v_mov_b32_e32 v1, s2 global_store_dword v0, v1, s[0:1] ``` If either of the values being selected is changed so that either one isn't 0, or the other isn't a power of 2, instead it compiles to this, which works. ```asm s_load_dword s2, s[4:5], 0x0 s_load_dwordx2 s[0:1], s[4:5], 0x8 v_lshlrev_b32_e32 v0, 2, v0 s_waitcnt lgkmcnt(0) s_cmp_eq_u32 s2, 0 s_cselect_b32 s2, 0, 33 ;;#ASMSTART s_mov_b32 s2, s2 ;;#ASMEND v_mov_b32_e32 v1, s2 global_store_dword v0, v1, s[0:1] ```

llvmbot commented 4 months ago

Hi!

This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:

Check that no other contributor has already been assigned to this issue. If you believe that no one is actually working on it despite an assignment, ping the person. After one week without a response, the assignee may be changed.
In the comments of this issue, request for it to be assigned to you, or just create a pull request after following the steps below. Mention this issue in the description of the pull request.
Fix the issue locally.
Run the test suite locally. Remember that the subdirectories under test/ create fine-grained testing targets, so you can e.g. use make check-clang-ast to only run Clang's AST tests.
Create a Git commit.
Run git clang-format HEAD~1 to format your changes.
Open a pull request to the upstream repository on GitHub. Detailed instructions can be found in GitHub's documentation. Mention this issue in the description of the pull request.

If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.

llvmbot commented 4 months ago

@llvm/issue-subscribers-good-first-issue

Author: None (Engininja2)

Here is my test kernel: ```cpp static __global__ void test(const int32_t input, int32_t * result) { const int data = input ? 32 : 0; int output; asm volatile("s_mov_b32 %0, %1" : "=s"(output) : "s"(data)); result[threadIdx.x] = output; } ``` and here's the asm it compiles to for gfx900 on rocm 6.0.0, where it fails to assemble because of an invalid operand. ```asm s_load_dword s2, s[4:5], 0x0 s_load_dwordx2 s[0:1], s[4:5], 0x8 v_lshlrev_b32_e32 v0, 2, v0 s_waitcnt lgkmcnt(0) s_cmp_lg_u32 s2, 0 s_cselect_b64 s[2:3], -1, 0 v_cndmask_b32_e64 v1, 0, 1, s[2:3] v_lshlrev_b32_e32 v1, 5, v1 ;;#ASMSTART s_mov_b32 s2, v1 ;;#ASMEND v_mov_b32_e32 v1, s2 global_store_dword v0, v1, s[0:1] ``` If either of the values being selected is changed so that either one isn't 0, or the other isn't a power of 2, instead it compiles to this, which works. ```asm s_load_dword s2, s[4:5], 0x0 s_load_dwordx2 s[0:1], s[4:5], 0x8 v_lshlrev_b32_e32 v0, 2, v0 s_waitcnt lgkmcnt(0) s_cmp_eq_u32 s2, 0 s_cselect_b32 s2, 0, 33 ;;#ASMSTART s_mov_b32 s2, s2 ;;#ASMEND v_mov_b32_e32 v1, s2 global_store_dword v0, v1, s[0:1] ```

llvm / llvm-project

[AMDGPU] generates v_cndmask/lshlrev for uniform select between 0 and a power of 2 #87938