Open Hendiadyoin1 opened 3 weeks ago
@llvm/issue-subscribers-backend-risc-v
Author: Leon Albrecht (Hendiadyoin1)
@llvm/issue-subscribers-backend-aarch64
Author: Leon Albrecht (Hendiadyoin1)
@llvm/issue-subscribers-backend-x86
Author: Leon Albrecht (Hendiadyoin1)
On x86 we use LowerBUILD_VECTORAsVariablePermute which gets most basic cases, we lower to other variable shuffles as well as pshufb so haven't made use of every feature we could - but zero-able elements should be fairly straightforward.
On RISC-V the LLVM IR is scalarized and returns a i128 type
define dso_local noundef i128 @shuf_unchecked(long vector[2], long vector[2])(i128 noundef %num.coerce, i128 noundef %idx.coerce) local_unnamed_addr {
entry:
%0 = bitcast i128 %num.coerce to <2 x i64>
%1 = bitcast i128 %idx.coerce to <2 x i64>
%vecext = trunc i128 %idx.coerce to i64
%vecext3 = extractelement <2 x i64> %0, i64 %vecext
%vecinit = insertelement <2 x i64> poison, i64 %vecext3, i64 0
%vecext4 = extractelement <2 x i64> %1, i64 1
%vecext5 = extractelement <2 x i64> %0, i64 %vecext4
%vecinit6 = insertelement <2 x i64> %vecinit, i64 %vecext5, i64 1
%2 = bitcast <2 x i64> %vecinit6 to i128
ret i128 %2
}
On x86-64-v4 we keep the vector type though.
define dso_local noundef <2 x i64> @shuf_unchecked(long vector[2], long vector[2])(<2 x i64> noundef %num, <2 x i64> noundef %idx) local_unnamed_addr {
entry:
%vecext = extractelement <2 x i64> %idx, i64 0
%vecext1 = extractelement <2 x i64> %num, i64 %vecext
%vecinit = insertelement <2 x i64> poison, i64 %vecext1, i64 0
%vecext2 = extractelement <2 x i64> %idx, i64 1
%vecext3 = extractelement <2 x i64> %num, i64 %vecext2
%vecinit4 = insertelement <2 x i64> %vecinit, i64 %vecext3, i64 1
ret <2 x i64> %vecinit4
}
The i128 behaviour may have to do something with the standard calling convention of riscv, as that passes vectors in regular registers unless you enforce the calling convention variant.
(Note that [[riscv::vector:cc]]
which is supposed to do that does not seem to work?)
Ok looking into that a bit more, it seems to be a bug, as passing vectors should enable the vector ABI, but that cannot happen as it forgets that it should be passing vectors somewhere in the front end
Following pattern (c++):
is not optimized to a native vector shuffle by all architectures:
pshufb
with some index adjustments,tbl
instruction only in the byte sized case, failing to apply the same index folding trick x86 does, instead going through the stack to achieve dynamic indexingvrgather
in all cases, instead constructing the result vector element by element Compare here: https://godbolt.org/z/osPhv1KT4This especially applies to the range checked version of that code:
which all architectures support natively in some way, but no tested backend seems to generate the ideal code. x86 again seems to be quite good with its folding but fails to recognize and leverage the behavior of
pshufb
, even after giving it a hint on how a masked approach would look like:A more comprehensive list of functions in multiple versions and possible optimized versions can be found here: https://godbolt.org/z/Ej4hYsvPr Important notes from the code:
This behavior is likely partially caused by those code snippets not being canonicalized to a single IR instruction; Note that
llvm.vp.gather.*
only seems to handle stores to memory andshufflevector
seems to only allow shuffling by constants Also note that in comparison to gcc clang does not expose__builtin_shuffle (vec, mask)
or__builtin_shuffle (vec0, vec1, mask)
, which represent the first case.Just to compare, GCC on x86 fails to do any native shuffles and falls back to leveraging stack loads and even gets branch-y in the checked case, but has
__builtin_shuffle
which works quite good Sidenote: UBsan destroys the second case even in x86 although it does not contain any possible UB afaict