[X86][Aarch64][RISCV] Dynamic vector shuffle idiom is not always recognized and optimally lowered

Hendiadyoin1 commented 3 weeks ago

Following pattern (c++):

i32x4{num[idx[0]],
      num[idx[1]],
      num[idx[2]],
      num[idx[3]]}

is not optimized to a native vector shuffle by all architectures:

X86 with SSSE3 does a good job generating a pshufb with some index adjustments,
Arm only seems to do the same with the equivalent tbl instruction only in the byte sized case, failing to apply the same index folding trick x86 does, instead going through the stack to achieve dynamic indexing
Risc-V fails to do a vrgather in all cases, instead constructing the result vector element by element Compare here: https://godbolt.org/z/osPhv1KT4

This especially applies to the range checked version of that code:

i32x4{idx[0] >= 0 && idx[0] < 4 ? num[idx[0]] : 0,
      idx[1] >= 0 && idx[1] < 4 ? num[idx[1]] : 0,
      idx[2] >= 0 && idx[2] < 4 ? num[idx[2]] : 0,
      idx[3] >= 0 && idx[3] < 4 ? num[idx[3]] : 0}

which all architectures support natively in some way, but no tested backend seems to generate the ideal code. x86 again seems to be quite good with its folding but fails to recognize and leverage the behavior of pshufb, even after giving it a hint on how a masked approach would look like:

    idx = idx | ~((u32x4)idx < 4);
    return i32x4{
        (idx[0] != ~0 ? num[idx[0]] : 0),
        (idx[1] != ~0 ? num[idx[1]] : 0),
        (idx[2] != ~0 ? num[idx[2]] : 0),
        (idx[3] != ~0 ? num[idx[3]] : 0),
    }

A more comprehensive list of functions in multiple versions and possible optimized versions can be found here: https://godbolt.org/z/Ej4hYsvPr Important notes from the code:

// Note: x86-SSE3s `pshufb` inserts a 0 when the most significant bit of the
//       index element is set, so masking it of with 0xFF for oob elements works
//       fine [...]
// Note: ARMs `tbl` inserts a 0 when the index element is out of range [...]
// Note: RiscV-Vs vrgather.vv, similar to arm, sets a 0 when the index element
//       is out of bounds,
//       but additionally it acts with the current element width, making it not
//       need any additional index manipulation or masking

This behavior is likely partially caused by those code snippets not being canonicalized to a single IR instruction; Note that llvm.vp.gather.* only seems to handle stores to memory and shufflevector seems to only allow shuffling by constants Also note that in comparison to gcc clang does not expose __builtin_shuffle (vec, mask) or __builtin_shuffle (vec0, vec1, mask), which represent the first case.

Just to compare, GCC on x86 fails to do any native shuffles and falls back to leveraging stack loads and even gets branch-y in the checked case, but has __builtin_shuffle which works quite good Sidenote: UBsan destroys the second case even in x86 although it does not contain any possible UB afaict

llvmbot commented 3 weeks ago

@llvm/issue-subscribers-backend-risc-v

Author: Leon Albrecht (Hendiadyoin1)

Following pattern (c++): ```c++ i32x4{num[idx[0]], num[idx[1]], num[idx[2]], num[idx[3]]} ``` is not optimized to a native vector shuffle by all architectures: - X86 with SSSE3 does a good job generating a `pshufb` with some index adjustments, - Arm only seems to do the same with the equivalent `tbl` instruction only in the byte sized case, failing to apply the same index folding trick x86 does, instead going through the stack to achieve dynamic indexing - Risc-V fails to do a `vrgather` in all cases, instead constructing the result vector element by element Compare here: https://godbolt.org/z/osPhv1KT4 This especially applies to the range checked version of that code: ```c++ i32x4{idx[0] >= 0 && idx[0] < 4 ? num[idx[0]] : 0, idx[1] >= 0 && idx[1] < 4 ? num[idx[1]] : 0, idx[2] >= 0 && idx[2] < 4 ? num[idx[2]] : 0, idx[3] >= 0 && idx[3] < 4 ? num[idx[3]] : 0} ``` which all architectures support natively in some way, but no tested backend seems to generate the ideal code. x86 again seems to be quite good with its folding but fails to recognize and leverage the behavior of `pshufb`, even after giving it a hint on how a masked approach would look like: ```c++ idx = idx | ~((u32x4)idx < 4); return i32x4{ (idx[0] != ~0 ? num[idx[0]] : 0), (idx[1] != ~0 ? num[idx[1]] : 0), (idx[2] != ~0 ? num[idx[2]] : 0), (idx[3] != ~0 ? num[idx[3]] : 0), } ``` A more comprehensive list of functions in multiple versions and possible optimized versions can be found here: https://godbolt.org/z/Ej4hYsvPr Important notes from the code: ``` // Note: x86-SSE3s `pshufb` inserts a 0 when the most significant bit of the // index element is set, so masking it of with 0xFF for oob elements works // fine [...] // Note: ARMs `tbl` inserts a 0 when the index element is out of range [...] // Note: RiscV-Vs vrgather.vv, similar to arm, sets a 0 when the index element // is out of bounds, // but additionally it acts with the current element width, making it not // need any additional index manipulation or masking ``` This behavior is likely partially caused by those code snippets not being canonicalized to a single IR instruction; Note that [`llvm.vp.gather.*`](https://llvm.org/docs/LangRef.html#llvm-vp-gather-intrinsic) only seems to handle stores to memory and [`shufflevector`](https://llvm.org/docs/LangRef.html#shufflevector-instruction) seems to only allow shuffling by constants Also note that in comparison to gcc clang does not expose `__builtin_shuffle (vec, mask)` or `__builtin_shuffle (vec0, vec1, mask)`, which represent the first case. ---- Just to compare, GCC on x86 fails to do any native shuffles and falls back to leveraging stack loads and even gets branch-y in the checked case, but has `__builtin_shuffle` which works quite good Sidenote: UBsan destroys the second case even in x86 although it does not contain any possible UB afaict

llvmbot commented 3 weeks ago

@llvm/issue-subscribers-backend-aarch64

Author: Leon Albrecht (Hendiadyoin1)

Following pattern (c++): ```c++ i32x4{num[idx[0]], num[idx[1]], num[idx[2]], num[idx[3]]} ``` is not optimized to a native vector shuffle by all architectures: - X86 with SSSE3 does a good job generating a `pshufb` with some index adjustments, - Arm only seems to do the same with the equivalent `tbl` instruction only in the byte sized case, failing to apply the same index folding trick x86 does, instead going through the stack to achieve dynamic indexing - Risc-V fails to do a `vrgather` in all cases, instead constructing the result vector element by element Compare here: https://godbolt.org/z/osPhv1KT4 This especially applies to the range checked version of that code: ```c++ i32x4{idx[0] >= 0 && idx[0] < 4 ? num[idx[0]] : 0, idx[1] >= 0 && idx[1] < 4 ? num[idx[1]] : 0, idx[2] >= 0 && idx[2] < 4 ? num[idx[2]] : 0, idx[3] >= 0 && idx[3] < 4 ? num[idx[3]] : 0} ``` which all architectures support natively in some way, but no tested backend seems to generate the ideal code. x86 again seems to be quite good with its folding but fails to recognize and leverage the behavior of `pshufb`, even after giving it a hint on how a masked approach would look like: ```c++ idx = idx | ~((u32x4)idx < 4); return i32x4{ (idx[0] != ~0 ? num[idx[0]] : 0), (idx[1] != ~0 ? num[idx[1]] : 0), (idx[2] != ~0 ? num[idx[2]] : 0), (idx[3] != ~0 ? num[idx[3]] : 0), } ``` A more comprehensive list of functions in multiple versions and possible optimized versions can be found here: https://godbolt.org/z/Ej4hYsvPr Important notes from the code: ``` // Note: x86-SSE3s `pshufb` inserts a 0 when the most significant bit of the // index element is set, so masking it of with 0xFF for oob elements works // fine [...] // Note: ARMs `tbl` inserts a 0 when the index element is out of range [...] // Note: RiscV-Vs vrgather.vv, similar to arm, sets a 0 when the index element // is out of bounds, // but additionally it acts with the current element width, making it not // need any additional index manipulation or masking ``` This behavior is likely partially caused by those code snippets not being canonicalized to a single IR instruction; Note that [`llvm.vp.gather.*`](https://llvm.org/docs/LangRef.html#llvm-vp-gather-intrinsic) only seems to handle stores to memory and [`shufflevector`](https://llvm.org/docs/LangRef.html#shufflevector-instruction) seems to only allow shuffling by constants Also note that in comparison to gcc clang does not expose `__builtin_shuffle (vec, mask)` or `__builtin_shuffle (vec0, vec1, mask)`, which represent the first case. ---- Just to compare, GCC on x86 fails to do any native shuffles and falls back to leveraging stack loads and even gets branch-y in the checked case, but has `__builtin_shuffle` which works quite good Sidenote: UBsan destroys the second case even in x86 although it does not contain any possible UB afaict

llvmbot commented 3 weeks ago

@llvm/issue-subscribers-backend-x86

Author: Leon Albrecht (Hendiadyoin1)

Following pattern (c++): ```c++ i32x4{num[idx[0]], num[idx[1]], num[idx[2]], num[idx[3]]} ``` is not optimized to a native vector shuffle by all architectures: - X86 with SSSE3 does a good job generating a `pshufb` with some index adjustments, - Arm only seems to do the same with the equivalent `tbl` instruction only in the byte sized case, failing to apply the same index folding trick x86 does, instead going through the stack to achieve dynamic indexing - Risc-V fails to do a `vrgather` in all cases, instead constructing the result vector element by element Compare here: https://godbolt.org/z/osPhv1KT4 This especially applies to the range checked version of that code: ```c++ i32x4{idx[0] >= 0 && idx[0] < 4 ? num[idx[0]] : 0, idx[1] >= 0 && idx[1] < 4 ? num[idx[1]] : 0, idx[2] >= 0 && idx[2] < 4 ? num[idx[2]] : 0, idx[3] >= 0 && idx[3] < 4 ? num[idx[3]] : 0} ``` which all architectures support natively in some way, but no tested backend seems to generate the ideal code. x86 again seems to be quite good with its folding but fails to recognize and leverage the behavior of `pshufb`, even after giving it a hint on how a masked approach would look like: ```c++ idx = idx | ~((u32x4)idx < 4); return i32x4{ (idx[0] != ~0 ? num[idx[0]] : 0), (idx[1] != ~0 ? num[idx[1]] : 0), (idx[2] != ~0 ? num[idx[2]] : 0), (idx[3] != ~0 ? num[idx[3]] : 0), } ``` A more comprehensive list of functions in multiple versions and possible optimized versions can be found here: https://godbolt.org/z/Ej4hYsvPr Important notes from the code: ``` // Note: x86-SSE3s `pshufb` inserts a 0 when the most significant bit of the // index element is set, so masking it of with 0xFF for oob elements works // fine [...] // Note: ARMs `tbl` inserts a 0 when the index element is out of range [...] // Note: RiscV-Vs vrgather.vv, similar to arm, sets a 0 when the index element // is out of bounds, // but additionally it acts with the current element width, making it not // need any additional index manipulation or masking ``` This behavior is likely partially caused by those code snippets not being canonicalized to a single IR instruction; Note that [`llvm.vp.gather.*`](https://llvm.org/docs/LangRef.html#llvm-vp-gather-intrinsic) only seems to handle stores to memory and [`shufflevector`](https://llvm.org/docs/LangRef.html#shufflevector-instruction) seems to only allow shuffling by constants Also note that in comparison to gcc clang does not expose `__builtin_shuffle (vec, mask)` or `__builtin_shuffle (vec0, vec1, mask)`, which represent the first case. ---- Just to compare, GCC on x86 fails to do any native shuffles and falls back to leveraging stack loads and even gets branch-y in the checked case, but has `__builtin_shuffle` which works quite good Sidenote: UBsan destroys the second case even in x86 although it does not contain any possible UB afaict

RKSimon commented 3 weeks ago

On x86 we use LowerBUILD_VECTORAsVariablePermute which gets most basic cases, we lower to other variable shuffles as well as pshufb so haven't made use of every feature we could - but zero-able elements should be fairly straightforward.

lukel97 commented 3 weeks ago

On RISC-V the LLVM IR is scalarized and returns a i128 type

define dso_local noundef i128 @shuf_unchecked(long vector[2], long vector[2])(i128 noundef %num.coerce, i128 noundef %idx.coerce) local_unnamed_addr {
entry:
  %0 = bitcast i128 %num.coerce to <2 x i64>
  %1 = bitcast i128 %idx.coerce to <2 x i64>
  %vecext = trunc i128 %idx.coerce to i64
  %vecext3 = extractelement <2 x i64> %0, i64 %vecext
  %vecinit = insertelement <2 x i64> poison, i64 %vecext3, i64 0
  %vecext4 = extractelement <2 x i64> %1, i64 1
  %vecext5 = extractelement <2 x i64> %0, i64 %vecext4
  %vecinit6 = insertelement <2 x i64> %vecinit, i64 %vecext5, i64 1
  %2 = bitcast <2 x i64> %vecinit6 to i128
  ret i128 %2
}

On x86-64-v4 we keep the vector type though.

define dso_local noundef <2 x i64> @shuf_unchecked(long vector[2], long vector[2])(<2 x i64> noundef %num, <2 x i64> noundef %idx) local_unnamed_addr {
entry:
  %vecext = extractelement <2 x i64> %idx, i64 0
  %vecext1 = extractelement <2 x i64> %num, i64 %vecext
  %vecinit = insertelement <2 x i64> poison, i64 %vecext1, i64 0
  %vecext2 = extractelement <2 x i64> %idx, i64 1
  %vecext3 = extractelement <2 x i64> %num, i64 %vecext2
  %vecinit4 = insertelement <2 x i64> %vecinit, i64 %vecext3, i64 1
  ret <2 x i64> %vecinit4
}

Hendiadyoin1 commented 3 weeks ago

The i128 behaviour may have to do something with the standard calling convention of riscv, as that passes vectors in regular registers unless you enforce the calling convention variant. (Note that [[riscv::vector:cc]] which is supposed to do that does not seem to work?)

Hendiadyoin1 commented 3 weeks ago

Ok looking into that a bit more, it seems to be a bug, as passing vectors should enable the vector ABI, but that cannot happen as it forgets that it should be passing vectors somewhere in the front end

llvm / llvm-project

[X86][Aarch64][RISCV] Dynamic vector shuffle idiom is not always recognized and optimally lowered #96754