[SLP][RISCV] Sub-optimal IR for mul(S, step_vector())

preames commented 1 year ago

Noticed some odd SLP behavior around (non-constant) strided indexing for gathers on RISCV. In the case below, we're accessing every 'S'th element. While SLP does manage to vectorize this, it does so in a far from ideal way. First, it appears to not know that shl and mul are interchangeable in this case. Second, it appears to not realize that 0 and %5 can be represented by mul of %5 by 0 and 1 respectively.

I know that SLP has the alternative opcode mechanism, but this is one of the areas of SLP internals I've always found the most confusing. Not sure if this is a small bug somewhere, or a huge architectural lift. Guidance welcomed. :)

clang -march=rv64gcv -target riscv64-unknown-linux-gnu -S -O3 example.c cat example.c

typedef signed char int8_t;

// Signed subtract which is not UB on overflow
inline int8_t sub(int8_t a, int8_t b) {
  return (int8_t)((int)a - (int)b);
}

int sum_of_absolute_diff_8(int8_t* restrict a, int8_t* restrict b, int S) {
  int sum = 0; 
  for (unsigned i = 0; i < 8; i++)
    sum += sub(0, b[i*S]);
  return sum;
}

*** IR Dump After SLPVectorizerPass on sum_of_absolute_diff_8 ***
; Function Attrs: mustprogress nofree norecurse nosync nounwind willreturn memory(argmem: read) uwtable vscale_range(2,1024)
define dso_local signext i32 @sum_of_absolute_diff_8(ptr noalias nocapture noundef readnone %a, ptr noalias nocapture noundef readonly %b, i32 noundef signext %S) local_unnamed_addr #0 {
entry:
  %idxprom.1 = zext i32 %S to i64
  %arrayidx.1 = getelementptr inbounds i8, ptr %b, i64 %idxprom.1
  %0 = insertelement <4 x i32> poison, i32 %S, i32 0
  %1 = shufflevector <4 x i32> %0, <4 x i32> poison, <4 x i32> zeroinitializer
  %2 = shl <4 x i32> %1, <i32 1, i32 3, i32 2, i32 5>
  %3 = mul <4 x i32> %1, <i32 1, i32 3, i32 2, i32 5>
  %4 = shufflevector <4 x i32> %2, <4 x i32> %3, <4 x i32> <i32 0, i32 5, i32 2, i32 7>
  %mul.6 = mul i32 %S, 6
  %idxprom.6 = zext i32 %mul.6 to i64
  %arrayidx.6 = getelementptr inbounds i8, ptr %b, i64 %idxprom.6
  %mul.7 = mul i32 %S, 7
  %idxprom.7 = zext i32 %mul.7 to i64
  %arrayidx.7 = getelementptr inbounds i8, ptr %b, i64 %idxprom.7
  %5 = load i8, ptr %b, align 1, !tbaa !7
  %6 = load i8, ptr %arrayidx.1, align 1, !tbaa !7
  %7 = zext <4 x i32> %4 to <4 x i64>
  %8 = insertelement <4 x ptr> poison, ptr %b, i32 0
  %9 = shufflevector <4 x ptr> %8, <4 x ptr> poison, <4 x i32> zeroinitializer
  %10 = getelementptr i8, <4 x ptr> %9, <4 x i64> %7
  %11 = call <4 x i8> @llvm.masked.gather.v4i8.v4p0(<4 x ptr> %10, i32 1, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i8> poison), !tbaa !7
  %12 = load i8, ptr %arrayidx.6, align 1, !tbaa !7
  %13 = load i8, ptr %arrayidx.7, align 1, !tbaa !7
  %14 = insertelement <8 x i8> poison, i8 %6, i32 0
  %15 = insertelement <8 x i8> %14, i8 %5, i32 1
  %16 = shufflevector <4 x i8> %11, <4 x i8> poison, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 poison, i32 poison, i32 poison, i32 poison>
  %17 = shufflevector <8 x i8> %15, <8 x i8> %16, <8 x i32> <i32 0, i32 1, i32 8, i32 9, i32 10, i32 11, i32 poison, i32 poison>
  %18 = insertelement <8 x i8> %17, i8 %12, i32 6
  %19 = insertelement <8 x i8> %18, i8 %13, i32 7
  %20 = sub <8 x i8> zeroinitializer, %19
  %21 = sext <8 x i8> %20 to <8 x i32>
  %22 = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> %21)
  ret i32 %22
}

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-risc-v

alexey-bataev commented 1 year ago

There are 2 well known issues, which are still WIP.

Compatible operations with the different opcodes (shl/mul, shr/div).
Vectorization of the 'copyable' elements in integer binary ops.(https://reviews.llvm.org/D28907 tried to fix this but in the good way, depends on non-power-of-2 vectorization)

llvm / llvm-project

[SLP][RISCV] Sub-optimal IR for mul(S, step_vector()) #63855