Noticed some odd SLP behavior around (non-constant) strided indexing for gathers on RISCV. In the case below, we're accessing every 'S'th element. While SLP does manage to vectorize this, it does so in a far from ideal way. First, it appears to not know that shl and mul are interchangeable in this case. Second, it appears to not realize that 0 and %5 can be represented by mul of %5 by 0 and 1 respectively.
I know that SLP has the alternative opcode mechanism, but this is one of the areas of SLP internals I've always found the most confusing. Not sure if this is a small bug somewhere, or a huge architectural lift. Guidance welcomed. :)
typedef signed char int8_t;
// Signed subtract which is not UB on overflow
inline int8_t sub(int8_t a, int8_t b) {
return (int8_t)((int)a - (int)b);
}
int sum_of_absolute_diff_8(int8_t* restrict a, int8_t* restrict b, int S) {
int sum = 0;
for (unsigned i = 0; i < 8; i++)
sum += sub(0, b[i*S]);
return sum;
}
*** IR Dump After SLPVectorizerPass on sum_of_absolute_diff_8 ***
; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(argmem: read) uwtable vscale_range(2,1024)
define dso_local signext i32 @sum_of_absolute_diff_8(ptr noalias nocapture noundef readnone %a, ptr noalias nocapture noundef readonly %b, i32 noundef signext %S) local_unnamed_addr #0 {
entry:
%idxprom.1 = zext i32 %S to i64
%arrayidx.1 = getelementptr inbounds i8, ptr %b, i64 %idxprom.1
%0 = insertelement <4 x i32> poison, i32 %S, i32 0
%1 = shufflevector <4 x i32> %0, <4 x i32> poison, <4 x i32> zeroinitializer
%2 = shl <4 x i32> %1, <i32 1, i32 3, i32 2, i32 5>
%3 = mul <4 x i32> %1, <i32 1, i32 3, i32 2, i32 5>
%4 = shufflevector <4 x i32> %2, <4 x i32> %3, <4 x i32> <i32 0, i32 5, i32 2, i32 7>
%mul.6 = mul i32 %S, 6
%idxprom.6 = zext i32 %mul.6 to i64
%arrayidx.6 = getelementptr inbounds i8, ptr %b, i64 %idxprom.6
%mul.7 = mul i32 %S, 7
%idxprom.7 = zext i32 %mul.7 to i64
%arrayidx.7 = getelementptr inbounds i8, ptr %b, i64 %idxprom.7
%5 = load i8, ptr %b, align 1, !tbaa !7
%6 = load i8, ptr %arrayidx.1, align 1, !tbaa !7
%7 = zext <4 x i32> %4 to <4 x i64>
%8 = insertelement <4 x ptr> poison, ptr %b, i32 0
%9 = shufflevector <4 x ptr> %8, <4 x ptr> poison, <4 x i32> zeroinitializer
%10 = getelementptr i8, <4 x ptr> %9, <4 x i64> %7
%11 = call <4 x i8> @llvm.masked.gather.v4i8.v4p0(<4 x ptr> %10, i32 1, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i8> poison), !tbaa !7
%12 = load i8, ptr %arrayidx.6, align 1, !tbaa !7
%13 = load i8, ptr %arrayidx.7, align 1, !tbaa !7
%14 = insertelement <8 x i8> poison, i8 %6, i32 0
%15 = insertelement <8 x i8> %14, i8 %5, i32 1
%16 = shufflevector <4 x i8> %11, <4 x i8> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison>
%17 = shufflevector <8 x i8> %15, <8 x i8> %16, <8 x i32> <i32 0, i32 1, i32 8, i32 9, i32 10, i32 11, i32 poison, i32 poison>
%18 = insertelement <8 x i8> %17, i8 %12, i32 6
%19 = insertelement <8 x i8> %18, i8 %13, i32 7
%20 = sub <8 x i8> zeroinitializer, %19
%21 = sext <8 x i8> %20 to <8 x i32>
%22 = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %21)
ret i32 %22
}
There are 2 well known issues, which are still WIP.
Compatible operations with the different opcodes (shl/mul, shr/div).
Vectorization of the 'copyable' elements in integer binary ops.(https://reviews.llvm.org/D28907 tried to fix this but in the good way, depends on non-power-of-2 vectorization)
Noticed some odd SLP behavior around (non-constant) strided indexing for gathers on RISCV. In the case below, we're accessing every 'S'th element. While SLP does manage to vectorize this, it does so in a far from ideal way. First, it appears to not know that shl and mul are interchangeable in this case. Second, it appears to not realize that 0 and %5 can be represented by mul of %5 by 0 and 1 respectively.
I know that SLP has the alternative opcode mechanism, but this is one of the areas of SLP internals I've always found the most confusing. Not sure if this is a small bug somewhere, or a huge architectural lift. Guidance welcomed. :)
clang -march=rv64gcv -target riscv64-unknown-linux-gnu -S -O3 example.c
cat example.c