[SLPVectorizer] Performance degradation with xor + fshl on X86

annamthomas commented 1 year ago

With SLP Vectorizer, a hot loop with 6 xors + 2 fshl get reduced to 3 xors + 1 fshl. We vectorize with a VF of 2. The SLP cost model gives it a cost of -8.

This is the loop in question:

  %iv = phi i64 [ %add323, %vectorized_slp_bb ], [ 16, %bb2 ]
  %add288 = add nsw i64 %iv, -3
  %getelementptr289 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add288
  %load290 = load i32, ptr addrspace(1) %getelementptr289, align 4, !tbaa !28, !noundef !3
  %add291 = add nsw i64 %iv, -8
  %getelementptr292 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add291
  %load293 = load i32, ptr addrspace(1) %getelementptr292, align 4, !tbaa !28, !noundef !3
  %xor294 = xor i32 %load293, %load290
  %add295 = add nsw i64 %iv, -14 
  %getelementptr296 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add295
  %load297 = load i32, ptr addrspace(1) %getelementptr296, align 4, !tbaa !28, !noundef !3
  %xor298 = xor i32 %xor294, %load297
  %add299 = add nsw i64 %iv, -16 
  %getelementptr300 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add299
  %load301 = load i32, ptr addrspace(1) %getelementptr300, align 4, !tbaa !28, !noundef !3
  %xor302 = xor i32 %xor298, %load301
  %call303 = call i32 @llvm.fshl.i32(i32 %xor302, i32 %xor302, i32 1) #5
  %getelementptr304 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %iv 
  store i32 %call303, ptr addrspace(1) %getelementptr304, align 4, !tbaa !28 
  %add305 = add nuw nsw i64 %iv, 1
  %add306 = add nsw i64 %iv, -2
  %getelementptr307 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add306
  %load308 = load i32, ptr addrspace(1) %getelementptr307, align 4, !tbaa !28, !noundef !3
  %add309 = add nsw i64 %iv, -7
  %getelementptr310 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add309
  %load311 = load i32, ptr addrspace(1) %getelementptr310, align 4, !tbaa !28, !noundef !3
  %xor312 = xor i32 %load311, %load308
  %add313 = add nsw i64 %iv, -13 
  %getelementptr314 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add313
  %load315 = load i32, ptr addrspace(1) %getelementptr314, align 4, !tbaa !28, !noundef !3
  %xor316 = xor i32 %xor312, %load315
  %add317 = add nsw i64 %iv, -15
  %getelementptr318 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add317
  %load319 = load i32, ptr addrspace(1) %getelementptr318, align 4, !tbaa !28, !noundef !3
  %xor320 = xor i32 %xor316, %load319
  %call321 = call i32 @llvm.fshl.i32(i32 %xor320, i32 %xor320, i32 1) #5
  %getelementptr322 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add305
  store i32 %call321, ptr addrspace(1) %getelementptr322, align 4, !tbaa !28
  %add323 = add nuw nsw i64 %iv, 2
  %icmp324 = icmp ugt i64 %add305, 78
  br i1 %icmp324, label %6, label %vectorized_slp_bb

When we vectorize it, we get:

vectorized_slp_bb:                                ; preds = %vectorized_slp_bb, %bb2
  %iv = phi i64 [ %add323, %vectorized_slp_bb ], [ 16, %bb2 ]
  %add288 = add nsw i64 %iv, -3
  %getelementptr289 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add288
  %add291 = add nsw i64 %iv, -8
  %getelementptr292 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add291
  %add295 = add nsw i64 %iv, -14 
  %getelementptr296 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add295
  %add299 = add nsw i64 %iv, -16 
  %getelementptr300 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %add299
  %getelementptr304 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr4, i64 %iv 
  %add305 = add nuw nsw i64 %iv, 1
  %25 = load <2 x i32>, ptr addrspace(1) %getelementptr289, align 4, !tbaa !28 
  %26 = load <2 x i32>, ptr addrspace(1) %getelementptr292, align 4, !tbaa !28 
  %27 = xor <2 x i32> %26, %25 
  %28 = load <2 x i32>, ptr addrspace(1) %getelementptr296, align 4, !tbaa !28 
  %29 = xor <2 x i32> %27, %28 
  %30 = load <2 x i32>, ptr addrspace(1) %getelementptr300, align 4, !tbaa !28 
  %31 = xor <2 x i32> %29, %30 
  %32 = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> %31, <2 x i32> %31, <2 x i32> <i32 1, i32 1>) 
  store <2 x i32> %32, ptr addrspace(1) %getelementptr304, align 4, !tbaa !28 
  %add323 = add nuw nsw i64 %iv, 2
  %icmp324 = icmp ugt i64 %add305, 78
  br i1 %icmp324, label %6, label %vectorized_slp_bb

We see about a 40% degradation on benchmark that optimizes this hot loop. The assembly for this loop shows we use 3 xor instead of vpxor and the fshl lowering using xmm registers:

 movq  -12(%rax,%rcx,4), %rdx 
  xorq  8(%rax,%rcx,4), %rdx 
  xorq  -36(%rax,%rcx,4), %rdx 
  xorq  -44(%rax,%rcx,4), %rdx 
  vmovq %rdx, %xmm0
  vpsrld  $31, %xmm0, %xmm1
  vpaddd  %xmm0, %xmm0, %xmm0
  vpor  %xmm1, %xmm0, %xmm0
  vmovq %xmm0, 20(%rax,%rcx,4)
  addq  $2, %rcx 
  cmpq  $78, %rcx

While looking at cost model for X86 arithmetic instructions, I do not see anything for v2i32 for XOR. Should we actually vectorize this loop?

Will attach the IR reproducer and -slp-threshold=2 shows we only vectorize this tree and still see the 40% degradation.

annamthomas commented 1 year ago

Here is the reproducer: https://godbolt.org/z/s1PjYMs5a

annamthomas commented 1 year ago

@alexey-bataev @RKSimon any pointers here on what the issue might be?

annamthomas commented 1 year ago

So, I tried printing the cost-model for xor and fshl under different mattr:

opt chk.ll -passes="print<cost-model>" 2>&1 -disable-output -mtriple=x86_64-- -mattr=+avx512f,+avx512bw

Cost Model: Found an estimated cost of 1 for instruction:   %V2I32 = xor <2 x i32> undef, undef
Cost Model: Found an estimated cost of 1 for instruction:   %call303 = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> %V2I32, <2 x i32> %V2I32, <2 x i32> <i32 1, i32 1>)
Cost Model: Found an estimated cost of 1 for instruction:   %V4I32 = xor <4 x i32> undef, undef
Cost Model: Found an estimated cost of 1 for instruction:   %call304 = call <4 x i32> @llvm.fshl.v4i32(<4 x i32> %V4I32, <4 x i32> %V4I32, <4 x i32> <i32 1, i32 1, i32 1, i32 1>)

opt chk.ll -passes="print" 2>&1 -disable-output -mtriple=x86_64-- -mattr=+avx2

Cost Model: Found an estimated cost of 1 for instruction:   %V2I32 = xor <2 x i32> undef, undef
Cost Model: Found an estimated cost of 4 for instruction:   %call303 = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> %V2I32, <2 x i32> %V2I32, <2 x i32> <i32 1, i32 1>)
Cost Model: Found an estimated cost of 1 for instruction:   %V4I32 = xor <4 x i32> undef, undef
Cost Model: Found an estimated cost of 4 for instruction:   %call304 = call <4 x i32> @llvm.fshl.v4i32(<4 x i32> %V4I32, <4 x i32> %V4I32, <4 x i32> <i32 1, i32 1, i32 1, i32 1>)

Changing -mattr=+avx2 on the above testcase should increase the cost of the vector fshl from 1 to 4. However, debug shows SLP cost model as -11 ! Shouldn't it reduce from -8 to -5 instead?

SLP: Found cost = -11 for VF=2
SLP: Decided to vectorize cost = -11

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-x86

RKSimon commented 1 year ago

The v2i32 types should be legalized to v4i32 inside the TTI callbacks, AVX512 does have a cheap v4i32 rotate instruction which is what those fshl should lower to, but on AVX2 they will expand to shifts/or - hence the higher cost.

What I think is going on though is the high cost of the scalar rotate instructions - we don't cost constant rotate amounts as cheaper than the variable rotate instructions, despite being notably faster on almost all hardware.:

{ ISD::ROTL,       MVT::i32,     {  2,  3,  1,  3 } },

So we'd need to add lower costs for constant rotate amounts.

alexey-bataev commented 1 year ago

I don't think it is directly related to SLP vectorizer, but x86 cost model. And Simon, probably , identified the actual issue.

annamthomas commented 1 year ago

Yep, thank you. I was looking at the cost of the vector ones and trying to increase those. @RKSimon I tried reducing the cost of ROTL and ROTR for i32 (scalar) to all ones - just as a quick check to see what happens. We did reduce the benefit of SLP Vectorization, so it is in the right direction (but we still vectorize) :

SLP: Found cost = -6 for VF=2
SLP: Decided to vectorize cost = -6

Earlier it was -8.

For reference, the scalar assembly is:

movl    8(%rcx,%rax,4), %esi
xorl    -12(%rcx,%rax,4), %esi
movl    -44(%rcx,%rax,4), %edi
xorl    -36(%rcx,%rax,4), %edi
xorl    %esi, %edi
rorxl   $31, %edi, %esi
movl    %esi, 20(%rcx,%rax,4)
incq    %rax
cmpq    $78, %rax
jbe     -38

AVX512 does have a cheap v4i32 rotate instruction which is what those fshl should lower to

I tested this with couple of different mattr and I think the cost may not be reflecting the lowering.

cat chk.ll
declare <2 x i32> @llvm.fshl.v2i32(<2 x i32>, <2 x i32>, <2 x i32>)

define <2 x i32> @splatconstant_funnnel_v2i32(<2 x i32> %x) nounwind {
  %res = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> %x, <2 x i32> %x, <2 x i32> <i32 1, i32 1>)
  ret <2 x i32> %res
}
llc -mtriple=x86_64-unknown-unknown -mattr=+avx512f,+avx512vbmi,+avx512vbmi2 -mcpu=cascadelake chk.ll 

splatconstant_funnnel_v2i32:            # @splatconstant_funnnel_v2i32
# %bb.0:
    vprold  $1, %xmm0, %xmm0
    retq

True for avx512 and for avx2 with +xop (lowers to single instruction: vprotd $1, %xmm0, %xmm0).

However, for the mattr from this repro.ll, we lowered fshl into three vector instructions:

llc -mattr=+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f -mcpu=cascadelake -mtriple=x86_64-unknown-linux-gnu chk.ll 
splatconstant_funnnel_v2i32:            # @splatconstant_funnnel_v2i32
# %bb.0:
    vpsrlvd .LCPI0_0(%rip), %xmm0, %xmm1
    vpsllvd .LCPI0_1(%rip), %xmm0, %xmm0
    vpor    %xmm1, %xmm0, %xmm0
    retq

RKSimon commented 1 year ago

So there's a couple of problems to address:

missing rotate/funnel-shift by constant scalar/vector costs
pre-AVX512 the DAG fails to expand rotate-shift-by-imm to use vector-shift-by-imm

annamthomas commented 1 year ago

Thanks @RKSimon for taking this up! I instrumented the SLPVectorizer cost model and confirmed that the fshl costs correctly reflect what we get in the lowered code.

annamthomas commented 1 year ago

Thank you for landing the above change @RKSimon. However, we don't see any difference in the performance of the internal benchmark. While we do generate better vector code (in some cases as seen in the test changes for the patch) it is still not as good as the scalar code. It looks like what needs correction is the X86 cost model. However, even with the lower costs for constant rotate amounts, we still SLP vectorize it.

Also, for the mattr where we see the performance degradation (we see across Skylake and cascade lake), we haven't reduced the number of vector instructions (I haven't checked the throughput of the changed instructions).

Original code generated was:

llc -mattr=+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f -mcpu=cascadelake -mtriple=x86_64-unknown-linux-gnu chk.ll 
splatconstant_funnnel_v2i32:            # @splatconstant_funnnel_v2i32
# %bb.0:
    vpsrlvd .LCPI0_0(%rip), %xmm0, %xmm1
    vpsllvd .LCPI0_1(%rip), %xmm0, %xmm0
    vpor    %xmm1, %xmm0, %xmm0
    retq

With the above improvement, we have:

    vpsrld  $31, %xmm0, %xmm1
    vpaddd  %xmm0, %xmm0, %xmm0
    vpor    %xmm1, %xmm0, %xmm0
    retq

For:

chk.ll :
declare <2 x i32> @llvm.fshl.v2i32(<2 x i32>, <2 x i32>, <2 x i32>)

define <2 x i32> @splatconstant_funnnel_v2i32(<2 x i32> %x) nounwind {
  %res = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> %x, <2 x i32> %x, <2 x i32> <i32 1, i32 1>)
  ret <2 x i32> %res
}

RKSimon commented 1 year ago

Sorry, still working on this - as you said the costs still need attention (needs some refactoring, while the backend fixes were pretty trivial).

annamthomas commented 1 year ago

No worries and thank you! I just wanted to clarify my understanding that to completely avoid the perf degradation here, we should not SLP vectorize this code (by fixing the X86 cost model).

annamthomas commented 1 year ago

I reduced the testcase with bugpoint to show minimal code needed for SLPVectorizing that same loop:

cat input.ll:

; ModuleID = 'reduced.ll'
source_filename = "reduced.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128-ni:1-p2:32:8:8:32-ni:2"
target triple = "x86_64-unknown-linux-gnu"

define void @blam(ptr addrspace(1) %arg, ptr addrspace(1) %arg1, i32 %arg2, ptr addrspace(1) %arg3) gc "statepoint-example" personality ptr @wibble !prof !2 {
bb:
  %load = load i64, ptr addrspace(256) inttoptr (i64 64 to ptr addrspace(256)), align 64, !invariant.load !3
  %getelementptr = getelementptr inbounds i8, ptr addrspace(1) %arg, i64 64
  %load4 = load ptr addrspace(1), ptr addrspace(1) %getelementptr, align 8, !tbaa !4, !dereferenceable_or_null !11, !align !12, !noundef !3
  %icmp = icmp eq ptr addrspace(1) %load4, null
  %load5 = load ptr addrspace(1), ptr addrspace(1) %getelementptr, align 8, !dereferenceable_or_null !11, !align !12, !noundef !3
  %icmp6 = icmp eq ptr addrspace(1) %load5, null
  %getelementptr7 = getelementptr inbounds i8, ptr addrspace(1) %load5, i64 16
  %getelementptr8 = getelementptr inbounds i8, ptr addrspace(1) %load5, i64 8
  %load9 = load i32, ptr addrspace(1) %getelementptr8, align 8, !range !13, !invariant.load !3, !noundef !3
  %icmp10 = icmp ugt i32 %load9, 79
  br label %bb11

bb11:                                             ; preds = %bb11, %bb
  %phi = phi i64 [ %add44, %bb11 ], [ 16, %bb ]
  %add = add nsw i64 %phi, -3
  %getelementptr12 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add
  %load13 = load i32, ptr addrspace(1) %getelementptr12, align 4, !tbaa !14, !noundef !3
  %add14 = add nsw i64 %phi, -8
  %getelementptr15 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add14
  %load16 = load i32, ptr addrspace(1) %getelementptr15, align 4, !tbaa !14, !noundef !3
  %xor = xor i32 %load16, %load13
  %add17 = add nsw i64 %phi, -14
  %getelementptr18 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add17
  %load19 = load i32, ptr addrspace(1) %getelementptr18, align 4, !tbaa !14, !noundef !3
  %xor20 = xor i32 %xor, %load19
  %add21 = add nsw i64 %phi, -16
  %getelementptr22 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add21
  %load23 = load i32, ptr addrspace(1) %getelementptr22, align 4, !tbaa !14, !noundef !3
  %xor24 = xor i32 %xor20, %load23
  %call = call i32 @llvm.fshl.i32(i32 %xor24, i32 %xor24, i32 1) #2
  %getelementptr25 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %phi
  store i32 %call, ptr addrspace(1) %getelementptr25, align 4, !tbaa !14
  %add26 = add nuw nsw i64 %phi, 1
  %add27 = add nsw i64 %phi, -2
  %getelementptr28 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add27
  %load29 = load i32, ptr addrspace(1) %getelementptr28, align 4, !tbaa !14, !noundef !3
  %add30 = add nsw i64 %phi, -7
  %getelementptr31 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add30
  %load32 = load i32, ptr addrspace(1) %getelementptr31, align 4, !tbaa !14, !noundef !3
  %xor33 = xor i32 %load32, %load29
  %add34 = add nsw i64 %phi, -13
  %getelementptr35 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add34
  %load36 = load i32, ptr addrspace(1) %getelementptr35, align 4, !tbaa !14, !noundef !3
  %xor37 = xor i32 %xor33, %load36
  %add38 = add nsw i64 %phi, -15
  %getelementptr39 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add38
  %load40 = load i32, ptr addrspace(1) %getelementptr39, align 4, !tbaa !14, !noundef !3
  %xor41 = xor i32 %xor37, %load40
  %call42 = call i32 @llvm.fshl.i32(i32 %xor41, i32 %xor41, i32 1) #2
  %getelementptr43 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add26
  store i32 %call42, ptr addrspace(1) %getelementptr43, align 4, !tbaa !14
  %add44 = add nuw nsw i64 %phi, 2
  %icmp45 = icmp ugt i64 %add26, 78
  br label %bb11
}

declare ptr @wibble()

declare void @wobble()

; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.fshl.i32(i32, i32, i32) #0

; Function Attrs: nocallback nofree nosync nounwind willreturn memory(read)
declare i64 @llvm.read_register.i64(metadata) #1

attributes #0 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
attributes #1 = { nocallback nofree nosync nounwind willreturn memory(read) }
attributes #2 = { "inline-remark"="cost=never, reason=unavailable definition" }

!llvm.module.flags = !{!0, !1}

!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{!"function_entry_count", i64 32768}
!3 = !{}
!4 = !{!5, !9, i64 64}
!5 = !{!"sun/security/provider/SHA.instance", !6, i64 0, !9, i64 64, !9, i64 72}
!6 = !{!"sun/security/provider/DigestBase.instance", !7, i64 0, !9, i64 16, !9, i64 24, !9, i64 28, !9, i64 32, !9, i64 40, !9, i64 48, !9, i64 56}
!7 = !{!"java/security/MessageDigestSpi.instance", !8, i64 0, !9, i64 8}
!8 = !{!"java/lang/Object.instance", !9, i64 0}
!9 = !{!"tbaa_local_fields", !10, i64 0}
!10 = !{!"dolphin-tbaa-access-type"}
!11 = !{i64 16}
!12 = !{i64 8}
!13 = !{i32 0, i32 2147483646}
!14 = !{!15, !15, i64 0}
!15 = !{!"int.array", !10}

Run this through opt -passes=slp-vectorizer input.ll. The cost is -8. With the accurate target-features supplied for this function from the machine we saw the degradation, the cost is -8 as well. This may not mean anything though because when I tried reducing the target-features blindly (trying to simulate bugpoint), the SLP benefit increased. This means something in the target-features matters and it's very hard to find out what exactly (so that we can update the X86 cost with these target-features).

The target-features attribute for the function is:

attributes #0 = { "target-features"="+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f,+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f,+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f,+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f,+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f" }

annamthomas commented 1 year ago

@RKSimon @alexey-bataev I went through the SLPVectorizer debug output and I think I found the issue for why we vectorize. It looks like we over-estimate the cost of the scalar loads as well, assuming we generate 8 loads whereas we infact only generate 2 loads.

Consider same reduced example as above, here is the main snippet:

  %phi = phi i64 [ %add44, %bb11 ], [ 16, %bb ]
  %add = add nsw i64 %phi, -3
  %getelementptr12 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add
  %load13 = load i32, ptr addrspace(1) %getelementptr12, align 4, !tbaa !14, !noundef !3
  %add14 = add nsw i64 %phi, -8
  %getelementptr15 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add14
  %load16 = load i32, ptr addrspace(1) %getelementptr15, align 4, !tbaa !14, !noundef !3
  %xor = xor i32 %load16, %load13
  %add17 = add nsw i64 %phi, -14
  %getelementptr18 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add17
  %load19 = load i32, ptr addrspace(1) %getelementptr18, align 4, !tbaa !14, !noundef !3
  %xor20 = xor i32 %xor, %load19
  %add21 = add nsw i64 %phi, -16
  %getelementptr22 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add21
  %load23 = load i32, ptr addrspace(1) %getelementptr22, align 4, !tbaa !14, !noundef !3
  %xor24 = xor i32 %xor20, %load23
  %call = call i32 @llvm.fshl.i32(i32 %xor24, i32 %xor24, i32 1) #2
  %getelementptr25 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %phi
  store i32 %call, ptr addrspace(1) %getelementptr25, align 4, !tbaa !14
  %add26 = add nuw nsw i64 %phi, 1
  %add27 = add nsw i64 %phi, -2
  %getelementptr28 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add27
  %load29 = load i32, ptr addrspace(1) %getelementptr28, align 4, !tbaa !14, !noundef !3
  %add30 = add nsw i64 %phi, -7
  %getelementptr31 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add30
  %load32 = load i32, ptr addrspace(1) %getelementptr31, align 4, !tbaa !14, !noundef !3
  %xor33 = xor i32 %load32, %load29
  %add34 = add nsw i64 %phi, -13
  %getelementptr35 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add34
  %load36 = load i32, ptr addrspace(1) %getelementptr35, align 4, !tbaa !14, !noundef !3
  %xor37 = xor i32 %xor33, %load36
  %add38 = add nsw i64 %phi, -15
  %getelementptr39 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add38
  %load40 = load i32, ptr addrspace(1) %getelementptr39, align 4, !tbaa !14, !noundef !3
  %xor41 = xor i32 %xor37, %load40
  %call42 = call i32 @llvm.fshl.i32(i32 %xor41, i32 %xor41, i32 1) #2
  %getelementptr43 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add26
  store i32 %call42, ptr addrspace(1) %getelementptr43, align 4, !tbaa !14
  %add44 = add nuw nsw i64 %phi, 2
  %icmp45 = icmp ugt i64 %add26, 78
  br label %bb11

This is the assembly for the above scalar code with the accurate mattr:

# %bb.0:                                # %bb
  movq  64(%rdi), %rax 
  movl  $17, %ecx
  .p2align  4, 0x90
.LBB0_1:                                # %bb11
                                        # =>This Inner Loop Header: Depth=1
  movl  -20(%rax,%rcx,4), %edx
  movl  -16(%rax,%rcx,4), %esi
  xorl  (%rax,%rcx,4), %edx
  xorl  -44(%rax,%rcx,4), %edx
  xorl  -52(%rax,%rcx,4), %edx
  rorxl $31, %edx, %edx
  movl  %edx, 12(%rax,%rcx,4)
  xorl  4(%rax,%rcx,4), %esi
  xorl  -40(%rax,%rcx,4), %esi
  xorl  -48(%rax,%rcx,4), %esi
  rorxl $31, %esi, %edx                 
  movl  %edx, 16(%rax,%rcx,4)           
  addq  $2, %rcx
  jmp .LBB0_1

This is the vectorized assembly:

# %bb.0:                                # %bb
  movq  64(%rdi), %rax
  movl  $17, %ecx
  .p2align  4, 0x90
.LBB0_1:                                # %bb11
                                        # =>This Inner Loop Header: Depth=1
  movq  -20(%rax,%rcx,4), %rdx
  xorq  (%rax,%rcx,4), %rdx             
  xorq  -44(%rax,%rcx,4), %rdx
  xorq  -52(%rax,%rcx,4), %rdx
  vmovq %rdx, %xmm0
  vpsrld  $31, %xmm0, %xmm1
  vpaddd  %xmm0, %xmm0, %xmm0
  vpor  %xmm1, %xmm0, %xmm0
  vmovq %xmm0, 12(%rax,%rcx,4)
  addq  $2, %rcx
  jmp .LBB0_1

According to the X86 cost model/SLPVectorizer cost model, we calculate the SLP benefit as:

Scalar: 8 (loads) + 6 (xor) + 2 (store) + 2 (each fshl with cost of 1) = 18 // This fshl cost was corrected by Simon's patch to have lower cost of 1 per fshl. Vector: 4 (4 vectorized VF=2 loads) + 3 (3 vectorized VF=2 xor) + 1 (1 VF=2 store) + 4 (1 vectorized fshl cost) = 12 // We will need to correct fshl cost as well (it would be lower for constant rotate).

SLP Benefit = Vector - Scalar = 12 - 18 = -6

However, in the scalar assembly, we have only 2 loads (not 8). So the correct cost for scalar should be 12. Is this a correct understanding?

alexey-bataev commented 1 year ago

Possibly need to estimate that some of the loads are "free" in scalar code?

RKSimon commented 1 year ago

It depends on the cost kind - folded loads aren't "free" for rthroughput / latency / codesize (uop count on x86) - just instruction count. SLP uses rthroughput.

Plus we do an awful job of adding rthroughput costs together - we shouldn't just accumulate them together, we really need to account for resource usage.

alexey-bataev commented 1 year ago

Plus we do an awful job of adding rthroughput costs together - we shouldn't just accumulate them together, we really need to account for resource usage.

This is what I meant, actually. Just this work requires lots of time, accounting rthroughput for the "folded" loads is much easier to do.

RKSimon commented 1 year ago

llvm-mca analysis of the scalar vs vector snippets puts the vector version clearly in the lead: https://llvm.godbolt.org/z/rWPqdG1ve

so we need to look further into why the benchmark isn't reflecting this.

annamthomas commented 1 year ago

It could be the above code is over-simplified version (I did reduce using bugpoint). Will take a look at final version and see what is going on. I tried with exact mattr as the machine (apart from the mcpu), didn't make a difference in above mca output.

annamthomas commented 1 year ago

I've done the following experiment:

Bisect with running only upto before SLP (good score) and after SLP (bad score). This shows there are no further optimizations in play after SLP runs which may cause a different code gen. And the score degradation is the same as running pipeline after SLP.
With that, I ran the entire assembly generated in the above cases through mca. We still see that the vectorized output has better (i.e. lower) reciprocal throughput: https://llvm.godbolt.org/z/9qd1zzcnE

However, running through regular perf on cascade lake shows the scalar output is clearly better (higher throughput of 2.55 IPC for scalar versus 1.63 IPC on vector). This possibly points that the scheduler model used in MCA is not accurate enough?

We do see that the SLP model doesn't account for the folded loads. So, that is possibly orthogonal to why MCA doesn't reflect perf output?

With SLP:

         26,785.87 msec task-clock                #    1.179 CPUs utilized          
            32,405      context-switches          #    0.001 M/sec                  
               233      cpu-migrations            #    0.009 K/sec                  
           324,958      page-faults               #    0.012 M/sec                  
   101,476,320,179      cycles                    #    3.788 GHz                    
   165,888,098,787      instructions              #    1.63  insn per cycle         
    20,465,290,828      branches                  #  764.033 M/sec                  
       245,105,776      branch-misses             #    1.20% of all branches        

      22.717993889 seconds time elapsed

      25.694972000 seconds user
       1.077673000 seconds sys

Without SLP:

         27,514.70 msec task-clock                #    1.209 CPUs utilized          
            32,624      context-switches          #    0.001 M/sec                  
               270      cpu-migrations            #    0.010 K/sec                  
           327,263      page-faults               #    0.012 M/sec                  
   104,237,391,611      cycles                    #    3.788 GHz                    
   265,523,567,495      instructions              #    2.55  insn per cycle         
    29,650,594,977      branches                  # 1077.628 M/sec                  
       345,374,111      branch-misses             #    1.16% of all branches        

      22.766190302 seconds time elapsed

      26.432187000 seconds user
       1.067867000 seconds sys

RKSimon commented 1 year ago

@annamthomas Am I looking at these number correctly? The SLP (vector) seems faster than non-SLP (scalar)? Or are you just looking at the IPC?

annamthomas commented 1 year ago

@RKSimon Sorry, the above numbers should be ignored. Perf is running on a Java VM, which means we need to give sufficient time to warmup before measuring numbers. Analyzed this in detail along with the original benchmark to figure out why llvm-mca doesn't reflect the degradation. I think the reason is that reciprocal throughput computed by mca is correct only when there are no loop carried dependency. However, this entire sequence (source code: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/sun/security/provider/SHA2.java#L160) infact has loop carried dependencey. We write W[16] on first iteration and read it on the third iteration. This original loop is unrolled by a factor of 2, which generates the above scalar code we see in the description: https://github.com/llvm/llvm-project/issues/63980#issue-1814249108

SLP vectorizer decides to vectorize this (unrolled-by-2) loop and generates bad performance.

I have used perf normalized to benchmark iterations (instead of per second which was done above) to give accurate performance comparison:

Without SLP SecureRandomBench.nextBytes thrpt 1457544.609 ± 31492.455 ops/s

SecureRandomBench.nextBytes:CPI                       thrpt             0.658              clks/insn
SecureRandomBench.nextBytes:IPC                       thrpt             1.520              insns/clk
SecureRandomBench.nextBytes:L1-dcache-load-misses     thrpt            58.094                   #/op
SecureRandomBench.nextBytes:L1-dcache-loads           thrpt          1137.904                   #/op
SecureRandomBench.nextBytes:L1-dcache-stores          thrpt           639.830                   #/op
SecureRandomBench.nextBytes:L1-icache-load-misses     thrpt           135.188                   #/op
SecureRandomBench.nextBytes:LLC-load-misses           thrpt             0.198                   #/op
SecureRandomBench.nextBytes:LLC-loads                 thrpt             0.596                   #/op
SecureRandomBench.nextBytes:LLC-store-misses          thrpt             0.067                   #/op
SecureRandomBench.nextBytes:LLC-stores                thrpt             0.287                   #/op
SecureRandomBench.nextBytes:branch-misses             thrpt             7.700                   #/op
SecureRandomBench.nextBytes:branches                  thrpt           874.864                   #/op
SecureRandomBench.nextBytes:cycles                    thrpt          2692.120                   #/op
SecureRandomBench.nextBytes:dTLB-load-misses          thrpt             0.199                   #/op
SecureRandomBench.nextBytes:dTLB-loads                thrpt          1142.688                   #/op
SecureRandomBench.nextBytes:dTLB-store-misses         thrpt             0.012                   #/op
SecureRandomBench.nextBytes:dTLB-stores               thrpt           639.797                   #/op
SecureRandomBench.nextBytes:iTLB-load-misses          thrpt             4.489                   #/op
SecureRandomBench.nextBytes:iTLB-loads                thrpt             8.200                   #/op
SecureRandomBench.nextBytes:instructions              thrpt          4090.985                   #/op

With SLP: SecureRandomBench.nextBytes thrpt 909508.096 ± 982.611 ops/s


SecureRandomBench.nextBytes:CPI                       thrpt            0.659            clks/insn
SecureRandomBench.nextBytes:IPC                       thrpt            1.517            insns/clk
SecureRandomBench.nextBytes:L1-dcache-load-misses     thrpt           95.428                 #/op
SecureRandomBench.nextBytes:L1-dcache-loads           thrpt         1830.926                 #/op
SecureRandomBench.nextBytes:L1-dcache-stores          thrpt         1029.726                 #/op
SecureRandomBench.nextBytes:L1-icache-load-misses     thrpt          220.363                 #/op
SecureRandomBench.nextBytes:LLC-load-misses           thrpt            0.296                 #/op
SecureRandomBench.nextBytes:LLC-loads                 thrpt            0.903                 #/op
SecureRandomBench.nextBytes:LLC-store-misses          thrpt            0.112                 #/op
SecureRandomBench.nextBytes:LLC-stores                thrpt            0.489                 #/op
SecureRandomBench.nextBytes:branch-misses             thrpt           12.103                 #/op
SecureRandomBench.nextBytes:branches                  thrpt         1402.117                 #/op
SecureRandomBench.nextBytes:cycles                    thrpt         4329.138                 #/op
SecureRandomBench.nextBytes:dTLB-load-misses          thrpt            0.422                 #/op
SecureRandomBench.nextBytes:dTLB-loads                thrpt         1819.267                 #/op
SecureRandomBench.nextBytes:dTLB-store-misses         thrpt            0.027                 #/op
SecureRandomBench.nextBytes:dTLB-stores               thrpt         1023.054                 #/op
SecureRandomBench.nextBytes:iTLB-load-misses          thrpt            7.023                 #/op
SecureRandomBench.nextBytes:iTLB-loads                thrpt           13.154                 #/op
SecureRandomBench.nextBytes:instructions              thrpt         6567.043                 #/op

As we can see, SLP has higher cycles/op (4329.138 compared to 2692.120), while IPC is comparable.

annamthomas commented 5 months ago

This is what I meant, actually. Just this work requires lots of time, accounting rthroughput for the "folded" loads is much easier to do.

@alexey-bataev Just to bring this discussion back, we continue seeing the huge degradation with SLP vectorizing this code. Any pointers on how we can fix this (without hack-y code downstream) is greatly appreciated. We consider folded loads in scalar cost and we just state it as "free"? llvm-mca doesn't show an accurate representation of the actual cost, since we have a loop carried dependency (which MCA doesn't model).

alexey-bataev commented 5 months ago

That's not the issue of the SLP vectorizer, definitely the issue of the cost model. I think we need to teach TTI about folded loads for some scalar operations. I had a patch for this some time before somewhere on phabricator, but it did not go. Plus, required some extra tuning, I assume.

RKSimon commented 5 months ago

@annamthomas Let me take another look at this

RKSimon commented 5 months ago

@annamthomas Are you still seeing the perf regression fixed when setting slp-threshold=2?

At that threshold the vectorised rotate is still there, but we stop replacing 4 i32 loads with a v4i32 load and 4 extracts which might have been another problem - we're probably going to be better off keeping them scalar: https://godbolt.org/z/qdzTYG9Gv

RKSimon commented 4 months ago

reverse-ping @annamthomas

annamthomas commented 4 months ago

I'll check and let you know on Monday @RKSimon (just returned from vacations).

annamthomas commented 4 months ago

@RKSimon Even with -slp-threshold=2, we see the same degradation. I've confirmed in the IR that we no longer replace the 4 i32 loads with a v4i32 load and 4 extracts, so it doesn't seem that's the problem here.

llvm / llvm-project

[SLPVectorizer] Performance degradation with xor + fshl on X86 #63980