Open annamthomas opened 1 year ago
Here is the reproducer: https://godbolt.org/z/s1PjYMs5a
@alexey-bataev @RKSimon any pointers here on what the issue might be?
So, I tried printing the cost-model for xor and fshl under different mattr:
opt chk.ll -passes="print<cost-model>" 2>&1 -disable-output -mtriple=x86_64-- -mattr=+avx512f,+avx512bw
Cost Model: Found an estimated cost of 1 for instruction: %V2I32 = xor <2 x i32> undef, undef
Cost Model: Found an estimated cost of 1 for instruction: %call303 = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> %V2I32, <2 x i32> %V2I32, <2 x i32> <i32 1, i32 1>)
Cost Model: Found an estimated cost of 1 for instruction: %V4I32 = xor <4 x i32> undef, undef
Cost Model: Found an estimated cost of 1 for instruction: %call304 = call <4 x i32> @llvm.fshl.v4i32(<4 x i32> %V4I32, <4 x i32> %V4I32, <4 x i32> <i32 1, i32 1, i32 1, i32 1>)
opt chk.ll -passes="print
Cost Model: Found an estimated cost of 1 for instruction: %V2I32 = xor <2 x i32> undef, undef
Cost Model: Found an estimated cost of 4 for instruction: %call303 = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> %V2I32, <2 x i32> %V2I32, <2 x i32> <i32 1, i32 1>)
Cost Model: Found an estimated cost of 1 for instruction: %V4I32 = xor <4 x i32> undef, undef
Cost Model: Found an estimated cost of 4 for instruction: %call304 = call <4 x i32> @llvm.fshl.v4i32(<4 x i32> %V4I32, <4 x i32> %V4I32, <4 x i32> <i32 1, i32 1, i32 1, i32 1>)
Changing -mattr=+avx2 on the above testcase should increase the cost of the vector fshl from 1 to 4. However, debug shows SLP cost model as -11 ! Shouldn't it reduce from -8 to -5 instead?
SLP: Found cost = -11 for VF=2
SLP: Decided to vectorize cost = -11
@llvm/issue-subscribers-backend-x86
The v2i32 types should be legalized to v4i32 inside the TTI callbacks, AVX512 does have a cheap v4i32 rotate instruction which is what those fshl should lower to, but on AVX2 they will expand to shifts/or - hence the higher cost.
What I think is going on though is the high cost of the scalar rotate instructions - we don't cost constant rotate amounts as cheaper than the variable rotate instructions, despite being notably faster on almost all hardware.:
{ ISD::ROTL, MVT::i32, { 2, 3, 1, 3 } },
So we'd need to add lower costs for constant rotate amounts.
I don't think it is directly related to SLP vectorizer, but x86 cost model. And Simon, probably , identified the actual issue.
Yep, thank you. I was looking at the cost of the vector ones and trying to increase those. @RKSimon I tried reducing the cost of ROTL and ROTR for i32 (scalar) to all ones - just as a quick check to see what happens. We did reduce the benefit of SLP Vectorization, so it is in the right direction (but we still vectorize) :
SLP: Found cost = -6 for VF=2
SLP: Decided to vectorize cost = -6
Earlier it was -8.
For reference, the scalar assembly is:
movl 8(%rcx,%rax,4), %esi
xorl -12(%rcx,%rax,4), %esi
movl -44(%rcx,%rax,4), %edi
xorl -36(%rcx,%rax,4), %edi
xorl %esi, %edi
rorxl $31, %edi, %esi
movl %esi, 20(%rcx,%rax,4)
incq %rax
cmpq $78, %rax
jbe -38
AVX512 does have a cheap v4i32 rotate instruction which is what those fshl should lower to
I tested this with couple of different mattr and I think the cost may not be reflecting the lowering.
cat chk.ll
declare <2 x i32> @llvm.fshl.v2i32(<2 x i32>, <2 x i32>, <2 x i32>)
define <2 x i32> @splatconstant_funnnel_v2i32(<2 x i32> %x) nounwind {
%res = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> %x, <2 x i32> %x, <2 x i32> <i32 1, i32 1>)
ret <2 x i32> %res
}
llc -mtriple=x86_64-unknown-unknown -mattr=+avx512f,+avx512vbmi,+avx512vbmi2 -mcpu=cascadelake chk.ll
splatconstant_funnnel_v2i32: # @splatconstant_funnnel_v2i32
# %bb.0:
vprold $1, %xmm0, %xmm0
retq
True for avx512 and for avx2 with +xop (lowers to single instruction: vprotd $1, %xmm0, %xmm0).
However, for the mattr from this repro.ll, we lowered fshl into three vector instructions:
llc -mattr=+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f -mcpu=cascadelake -mtriple=x86_64-unknown-linux-gnu chk.ll
splatconstant_funnnel_v2i32: # @splatconstant_funnnel_v2i32
# %bb.0:
vpsrlvd .LCPI0_0(%rip), %xmm0, %xmm1
vpsllvd .LCPI0_1(%rip), %xmm0, %xmm0
vpor %xmm1, %xmm0, %xmm0
retq
So there's a couple of problems to address:
Thanks @RKSimon for taking this up! I instrumented the SLPVectorizer cost model and confirmed that the fshl costs correctly reflect what we get in the lowered code.
Thank you for landing the above change @RKSimon. However, we don't see any difference in the performance of the internal benchmark. While we do generate better vector code (in some cases as seen in the test changes for the patch) it is still not as good as the scalar code. It looks like what needs correction is the X86 cost model. However, even with the lower costs for constant rotate amounts, we still SLP vectorize it.
Also, for the mattr where we see the performance degradation (we see across Skylake and cascade lake), we haven't reduced the number of vector instructions (I haven't checked the throughput of the changed instructions).
Original code generated was:
llc -mattr=+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f -mcpu=cascadelake -mtriple=x86_64-unknown-linux-gnu chk.ll
splatconstant_funnnel_v2i32: # @splatconstant_funnnel_v2i32
# %bb.0:
vpsrlvd .LCPI0_0(%rip), %xmm0, %xmm1
vpsllvd .LCPI0_1(%rip), %xmm0, %xmm0
vpor %xmm1, %xmm0, %xmm0
retq
With the above improvement, we have:
vpsrld $31, %xmm0, %xmm1
vpaddd %xmm0, %xmm0, %xmm0
vpor %xmm1, %xmm0, %xmm0
retq
For:
chk.ll :
declare <2 x i32> @llvm.fshl.v2i32(<2 x i32>, <2 x i32>, <2 x i32>)
define <2 x i32> @splatconstant_funnnel_v2i32(<2 x i32> %x) nounwind {
%res = call <2 x i32> @llvm.fshl.v2i32(<2 x i32> %x, <2 x i32> %x, <2 x i32> <i32 1, i32 1>)
ret <2 x i32> %res
}
Sorry, still working on this - as you said the costs still need attention (needs some refactoring, while the backend fixes were pretty trivial).
No worries and thank you! I just wanted to clarify my understanding that to completely avoid the perf degradation here, we should not SLP vectorize this code (by fixing the X86 cost model).
I reduced the testcase with bugpoint to show minimal code needed for SLPVectorizing that same loop:
cat input.ll:
; ModuleID = 'reduced.ll'
source_filename = "reduced.ll"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128-ni:1-p2:32:8:8:32-ni:2"
target triple = "x86_64-unknown-linux-gnu"
define void @blam(ptr addrspace(1) %arg, ptr addrspace(1) %arg1, i32 %arg2, ptr addrspace(1) %arg3) gc "statepoint-example" personality ptr @wibble !prof !2 {
bb:
%load = load i64, ptr addrspace(256) inttoptr (i64 64 to ptr addrspace(256)), align 64, !invariant.load !3
%getelementptr = getelementptr inbounds i8, ptr addrspace(1) %arg, i64 64
%load4 = load ptr addrspace(1), ptr addrspace(1) %getelementptr, align 8, !tbaa !4, !dereferenceable_or_null !11, !align !12, !noundef !3
%icmp = icmp eq ptr addrspace(1) %load4, null
%load5 = load ptr addrspace(1), ptr addrspace(1) %getelementptr, align 8, !dereferenceable_or_null !11, !align !12, !noundef !3
%icmp6 = icmp eq ptr addrspace(1) %load5, null
%getelementptr7 = getelementptr inbounds i8, ptr addrspace(1) %load5, i64 16
%getelementptr8 = getelementptr inbounds i8, ptr addrspace(1) %load5, i64 8
%load9 = load i32, ptr addrspace(1) %getelementptr8, align 8, !range !13, !invariant.load !3, !noundef !3
%icmp10 = icmp ugt i32 %load9, 79
br label %bb11
bb11: ; preds = %bb11, %bb
%phi = phi i64 [ %add44, %bb11 ], [ 16, %bb ]
%add = add nsw i64 %phi, -3
%getelementptr12 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add
%load13 = load i32, ptr addrspace(1) %getelementptr12, align 4, !tbaa !14, !noundef !3
%add14 = add nsw i64 %phi, -8
%getelementptr15 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add14
%load16 = load i32, ptr addrspace(1) %getelementptr15, align 4, !tbaa !14, !noundef !3
%xor = xor i32 %load16, %load13
%add17 = add nsw i64 %phi, -14
%getelementptr18 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add17
%load19 = load i32, ptr addrspace(1) %getelementptr18, align 4, !tbaa !14, !noundef !3
%xor20 = xor i32 %xor, %load19
%add21 = add nsw i64 %phi, -16
%getelementptr22 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add21
%load23 = load i32, ptr addrspace(1) %getelementptr22, align 4, !tbaa !14, !noundef !3
%xor24 = xor i32 %xor20, %load23
%call = call i32 @llvm.fshl.i32(i32 %xor24, i32 %xor24, i32 1) #2
%getelementptr25 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %phi
store i32 %call, ptr addrspace(1) %getelementptr25, align 4, !tbaa !14
%add26 = add nuw nsw i64 %phi, 1
%add27 = add nsw i64 %phi, -2
%getelementptr28 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add27
%load29 = load i32, ptr addrspace(1) %getelementptr28, align 4, !tbaa !14, !noundef !3
%add30 = add nsw i64 %phi, -7
%getelementptr31 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add30
%load32 = load i32, ptr addrspace(1) %getelementptr31, align 4, !tbaa !14, !noundef !3
%xor33 = xor i32 %load32, %load29
%add34 = add nsw i64 %phi, -13
%getelementptr35 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add34
%load36 = load i32, ptr addrspace(1) %getelementptr35, align 4, !tbaa !14, !noundef !3
%xor37 = xor i32 %xor33, %load36
%add38 = add nsw i64 %phi, -15
%getelementptr39 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add38
%load40 = load i32, ptr addrspace(1) %getelementptr39, align 4, !tbaa !14, !noundef !3
%xor41 = xor i32 %xor37, %load40
%call42 = call i32 @llvm.fshl.i32(i32 %xor41, i32 %xor41, i32 1) #2
%getelementptr43 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add26
store i32 %call42, ptr addrspace(1) %getelementptr43, align 4, !tbaa !14
%add44 = add nuw nsw i64 %phi, 2
%icmp45 = icmp ugt i64 %add26, 78
br label %bb11
}
declare ptr @wibble()
declare void @wobble()
; Function Attrs: nocallback nofree nosync nounwind speculatable willreturn memory(none)
declare i32 @llvm.fshl.i32(i32, i32, i32) #0
; Function Attrs: nocallback nofree nosync nounwind willreturn memory(read)
declare i64 @llvm.read_register.i64(metadata) #1
attributes #0 = { nocallback nofree nosync nounwind speculatable willreturn memory(none) }
attributes #1 = { nocallback nofree nosync nounwind willreturn memory(read) }
attributes #2 = { "inline-remark"="cost=never, reason=unavailable definition" }
!llvm.module.flags = !{!0, !1}
!0 = !{i32 2, !"Dwarf Version", i32 4}
!1 = !{i32 2, !"Debug Info Version", i32 3}
!2 = !{!"function_entry_count", i64 32768}
!3 = !{}
!4 = !{!5, !9, i64 64}
!5 = !{!"sun/security/provider/SHA.instance", !6, i64 0, !9, i64 64, !9, i64 72}
!6 = !{!"sun/security/provider/DigestBase.instance", !7, i64 0, !9, i64 16, !9, i64 24, !9, i64 28, !9, i64 32, !9, i64 40, !9, i64 48, !9, i64 56}
!7 = !{!"java/security/MessageDigestSpi.instance", !8, i64 0, !9, i64 8}
!8 = !{!"java/lang/Object.instance", !9, i64 0}
!9 = !{!"tbaa_local_fields", !10, i64 0}
!10 = !{!"dolphin-tbaa-access-type"}
!11 = !{i64 16}
!12 = !{i64 8}
!13 = !{i32 0, i32 2147483646}
!14 = !{!15, !15, i64 0}
!15 = !{!"int.array", !10}
Run this through opt -passes=slp-vectorizer input.ll
.
The cost is -8.
With the accurate target-features
supplied for this function from the machine we saw the degradation, the cost is -8 as well.
This may not mean anything though because when I tried reducing the target-features blindly (trying to simulate bugpoint), the SLP benefit increased. This means something in the target-features
matters and it's very hard to find out what exactly (so that we can update the X86 cost with these target-features).
The target-features attribute for the function is:
attributes #0 = { "target-features"="+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f,+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f,+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f,+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f,+avx512cd,-sha,+xsaveopt,-avxvnni,-mwaitx,+sse4.2,-cldemote,+avx512f,+lzcnt,+fsgsbase,+aes,-sse4a,+rtm,+fma,-avx512vp2intersect,-avxneconvert,+popcnt,-prefetchi,+avx2,-uintr,-ptwrite,+fxsr,-pconfig,-avx512er,+rdseed,+pku,-rdpid,-avx512vbmi,-avx512vbmi2,+sse3,+xsaves,-amx-tile,-fma4,+invpcid,-cmpccxadd,-prefetchwt1,+ssse3,+pclmul,+clflushopt,-tsxldtrk,+crc32,+rdrnd,+sse2,-kl,-clzero,+bmi,-raoint,+xsavec,-serialize,-avxvnniint8,+sse,-rdpru,-tbm,-avx512bf16,-waitpkg,-amx-fp16,-avx512ifma,-vaes,+f16c,+sahf,+xsave,+avx512bw,-amx-int8,-vpclmulqdq,-sgx,-avx512fp16,-gfni,-amx-bf16,+bmi2,-movdir64b,+avx512vl,-xop,+prfchw,+cx16,-enqcmd,+64bit,-amx-complex,-avx512pf,-lwp,-avx512vpopcntdq,+avx512dq,+mmx,-avxifma,+avx512vnni,+avx,+cmov,-hreset,+sse4.1,+movbe,+adx,+clwb,-widekl,-movdiri,+cx8,-shstk,-avx512bitalg,-wbnoinvd,-avx512f" }
@RKSimon @alexey-bataev I went through the SLPVectorizer debug output and I think I found the issue for why we vectorize. It looks like we over-estimate the cost of the scalar loads as well, assuming we generate 8 loads whereas we infact only generate 2 loads.
Consider same reduced example as above, here is the main snippet:
%phi = phi i64 [ %add44, %bb11 ], [ 16, %bb ]
%add = add nsw i64 %phi, -3
%getelementptr12 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add
%load13 = load i32, ptr addrspace(1) %getelementptr12, align 4, !tbaa !14, !noundef !3
%add14 = add nsw i64 %phi, -8
%getelementptr15 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add14
%load16 = load i32, ptr addrspace(1) %getelementptr15, align 4, !tbaa !14, !noundef !3
%xor = xor i32 %load16, %load13
%add17 = add nsw i64 %phi, -14
%getelementptr18 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add17
%load19 = load i32, ptr addrspace(1) %getelementptr18, align 4, !tbaa !14, !noundef !3
%xor20 = xor i32 %xor, %load19
%add21 = add nsw i64 %phi, -16
%getelementptr22 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add21
%load23 = load i32, ptr addrspace(1) %getelementptr22, align 4, !tbaa !14, !noundef !3
%xor24 = xor i32 %xor20, %load23
%call = call i32 @llvm.fshl.i32(i32 %xor24, i32 %xor24, i32 1) #2
%getelementptr25 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %phi
store i32 %call, ptr addrspace(1) %getelementptr25, align 4, !tbaa !14
%add26 = add nuw nsw i64 %phi, 1
%add27 = add nsw i64 %phi, -2
%getelementptr28 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add27
%load29 = load i32, ptr addrspace(1) %getelementptr28, align 4, !tbaa !14, !noundef !3
%add30 = add nsw i64 %phi, -7
%getelementptr31 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add30
%load32 = load i32, ptr addrspace(1) %getelementptr31, align 4, !tbaa !14, !noundef !3
%xor33 = xor i32 %load32, %load29
%add34 = add nsw i64 %phi, -13
%getelementptr35 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add34
%load36 = load i32, ptr addrspace(1) %getelementptr35, align 4, !tbaa !14, !noundef !3
%xor37 = xor i32 %xor33, %load36
%add38 = add nsw i64 %phi, -15
%getelementptr39 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add38
%load40 = load i32, ptr addrspace(1) %getelementptr39, align 4, !tbaa !14, !noundef !3
%xor41 = xor i32 %xor37, %load40
%call42 = call i32 @llvm.fshl.i32(i32 %xor41, i32 %xor41, i32 1) #2
%getelementptr43 = getelementptr inbounds i32, ptr addrspace(1) %getelementptr7, i64 %add26
store i32 %call42, ptr addrspace(1) %getelementptr43, align 4, !tbaa !14
%add44 = add nuw nsw i64 %phi, 2
%icmp45 = icmp ugt i64 %add26, 78
br label %bb11
This is the assembly for the above scalar code with the accurate mattr:
# %bb.0: # %bb
movq 64(%rdi), %rax
movl $17, %ecx
.p2align 4, 0x90
.LBB0_1: # %bb11
# =>This Inner Loop Header: Depth=1
movl -20(%rax,%rcx,4), %edx
movl -16(%rax,%rcx,4), %esi
xorl (%rax,%rcx,4), %edx
xorl -44(%rax,%rcx,4), %edx
xorl -52(%rax,%rcx,4), %edx
rorxl $31, %edx, %edx
movl %edx, 12(%rax,%rcx,4)
xorl 4(%rax,%rcx,4), %esi
xorl -40(%rax,%rcx,4), %esi
xorl -48(%rax,%rcx,4), %esi
rorxl $31, %esi, %edx
movl %edx, 16(%rax,%rcx,4)
addq $2, %rcx
jmp .LBB0_1
This is the vectorized assembly:
# %bb.0: # %bb
movq 64(%rdi), %rax
movl $17, %ecx
.p2align 4, 0x90
.LBB0_1: # %bb11
# =>This Inner Loop Header: Depth=1
movq -20(%rax,%rcx,4), %rdx
xorq (%rax,%rcx,4), %rdx
xorq -44(%rax,%rcx,4), %rdx
xorq -52(%rax,%rcx,4), %rdx
vmovq %rdx, %xmm0
vpsrld $31, %xmm0, %xmm1
vpaddd %xmm0, %xmm0, %xmm0
vpor %xmm1, %xmm0, %xmm0
vmovq %xmm0, 12(%rax,%rcx,4)
addq $2, %rcx
jmp .LBB0_1
According to the X86 cost model/SLPVectorizer cost model, we calculate the SLP benefit as:
Scalar: 8 (loads) + 6 (xor) + 2 (store) + 2 (each fshl with cost of 1) = 18 // This fshl cost was corrected by Simon's patch to have lower cost of 1 per fshl. Vector: 4 (4 vectorized VF=2 loads) + 3 (3 vectorized VF=2 xor) + 1 (1 VF=2 store) + 4 (1 vectorized fshl cost) = 12 // We will need to correct fshl cost as well (it would be lower for constant rotate).
SLP Benefit = Vector - Scalar = 12 - 18 = -6
However, in the scalar assembly, we have only 2 loads (not 8). So the correct cost for scalar should be 12. Is this a correct understanding?
Possibly need to estimate that some of the loads are "free" in scalar code?
It depends on the cost kind - folded loads aren't "free" for rthroughput / latency / codesize (uop count on x86) - just instruction count. SLP uses rthroughput.
Plus we do an awful job of adding rthroughput costs together - we shouldn't just accumulate them together, we really need to account for resource usage.
Plus we do an awful job of adding rthroughput costs together - we shouldn't just accumulate them together, we really need to account for resource usage.
This is what I meant, actually. Just this work requires lots of time, accounting rthroughput for the "folded" loads is much easier to do.
llvm-mca analysis of the scalar vs vector snippets puts the vector version clearly in the lead: https://llvm.godbolt.org/z/rWPqdG1ve
so we need to look further into why the benchmark isn't reflecting this.
It could be the above code is over-simplified version (I did reduce using bugpoint). Will take a look at final version and see what is going on. I tried with exact mattr as the machine (apart from the mcpu), didn't make a difference in above mca output.
I've done the following experiment:
However, running through regular perf on cascade lake shows the scalar output is clearly better (higher throughput of 2.55 IPC for scalar versus 1.63 IPC on vector). This possibly points that the scheduler model used in MCA is not accurate enough?
We do see that the SLP model doesn't account for the folded loads. So, that is possibly orthogonal to why MCA doesn't reflect perf output?
With SLP:
26,785.87 msec task-clock # 1.179 CPUs utilized
32,405 context-switches # 0.001 M/sec
233 cpu-migrations # 0.009 K/sec
324,958 page-faults # 0.012 M/sec
101,476,320,179 cycles # 3.788 GHz
165,888,098,787 instructions # 1.63 insn per cycle
20,465,290,828 branches # 764.033 M/sec
245,105,776 branch-misses # 1.20% of all branches
22.717993889 seconds time elapsed
25.694972000 seconds user
1.077673000 seconds sys
Without SLP:
27,514.70 msec task-clock # 1.209 CPUs utilized
32,624 context-switches # 0.001 M/sec
270 cpu-migrations # 0.010 K/sec
327,263 page-faults # 0.012 M/sec
104,237,391,611 cycles # 3.788 GHz
265,523,567,495 instructions # 2.55 insn per cycle
29,650,594,977 branches # 1077.628 M/sec
345,374,111 branch-misses # 1.16% of all branches
22.766190302 seconds time elapsed
26.432187000 seconds user
1.067867000 seconds sys
@annamthomas Am I looking at these number correctly? The SLP (vector) seems faster than non-SLP (scalar)? Or are you just looking at the IPC?
@RKSimon Sorry, the above numbers should be ignored. Perf is running on a Java VM, which means we need to give sufficient time to warmup before measuring numbers. Analyzed this in detail along with the original benchmark to figure out why llvm-mca doesn't reflect the degradation. I think the reason is that reciprocal throughput computed by mca is correct only when there are no loop carried dependency. However, this entire sequence (source code: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/sun/security/provider/SHA2.java#L160) infact has loop carried dependencey. We write W[16] on first iteration and read it on the third iteration. This original loop is unrolled by a factor of 2, which generates the above scalar code we see in the description: https://github.com/llvm/llvm-project/issues/63980#issue-1814249108
SLP vectorizer decides to vectorize this (unrolled-by-2) loop and generates bad performance.
I have used perf normalized to benchmark iterations (instead of per second which was done above) to give accurate performance comparison:
Without SLP SecureRandomBench.nextBytes thrpt 1457544.609 ± 31492.455 ops/s
SecureRandomBench.nextBytes:CPI thrpt 0.658 clks/insn
SecureRandomBench.nextBytes:IPC thrpt 1.520 insns/clk
SecureRandomBench.nextBytes:L1-dcache-load-misses thrpt 58.094 #/op
SecureRandomBench.nextBytes:L1-dcache-loads thrpt 1137.904 #/op
SecureRandomBench.nextBytes:L1-dcache-stores thrpt 639.830 #/op
SecureRandomBench.nextBytes:L1-icache-load-misses thrpt 135.188 #/op
SecureRandomBench.nextBytes:LLC-load-misses thrpt 0.198 #/op
SecureRandomBench.nextBytes:LLC-loads thrpt 0.596 #/op
SecureRandomBench.nextBytes:LLC-store-misses thrpt 0.067 #/op
SecureRandomBench.nextBytes:LLC-stores thrpt 0.287 #/op
SecureRandomBench.nextBytes:branch-misses thrpt 7.700 #/op
SecureRandomBench.nextBytes:branches thrpt 874.864 #/op
SecureRandomBench.nextBytes:cycles thrpt 2692.120 #/op
SecureRandomBench.nextBytes:dTLB-load-misses thrpt 0.199 #/op
SecureRandomBench.nextBytes:dTLB-loads thrpt 1142.688 #/op
SecureRandomBench.nextBytes:dTLB-store-misses thrpt 0.012 #/op
SecureRandomBench.nextBytes:dTLB-stores thrpt 639.797 #/op
SecureRandomBench.nextBytes:iTLB-load-misses thrpt 4.489 #/op
SecureRandomBench.nextBytes:iTLB-loads thrpt 8.200 #/op
SecureRandomBench.nextBytes:instructions thrpt 4090.985 #/op
With SLP: SecureRandomBench.nextBytes thrpt 909508.096 ± 982.611 ops/s
SecureRandomBench.nextBytes:CPI thrpt 0.659 clks/insn
SecureRandomBench.nextBytes:IPC thrpt 1.517 insns/clk
SecureRandomBench.nextBytes:L1-dcache-load-misses thrpt 95.428 #/op
SecureRandomBench.nextBytes:L1-dcache-loads thrpt 1830.926 #/op
SecureRandomBench.nextBytes:L1-dcache-stores thrpt 1029.726 #/op
SecureRandomBench.nextBytes:L1-icache-load-misses thrpt 220.363 #/op
SecureRandomBench.nextBytes:LLC-load-misses thrpt 0.296 #/op
SecureRandomBench.nextBytes:LLC-loads thrpt 0.903 #/op
SecureRandomBench.nextBytes:LLC-store-misses thrpt 0.112 #/op
SecureRandomBench.nextBytes:LLC-stores thrpt 0.489 #/op
SecureRandomBench.nextBytes:branch-misses thrpt 12.103 #/op
SecureRandomBench.nextBytes:branches thrpt 1402.117 #/op
SecureRandomBench.nextBytes:cycles thrpt 4329.138 #/op
SecureRandomBench.nextBytes:dTLB-load-misses thrpt 0.422 #/op
SecureRandomBench.nextBytes:dTLB-loads thrpt 1819.267 #/op
SecureRandomBench.nextBytes:dTLB-store-misses thrpt 0.027 #/op
SecureRandomBench.nextBytes:dTLB-stores thrpt 1023.054 #/op
SecureRandomBench.nextBytes:iTLB-load-misses thrpt 7.023 #/op
SecureRandomBench.nextBytes:iTLB-loads thrpt 13.154 #/op
SecureRandomBench.nextBytes:instructions thrpt 6567.043 #/op
As we can see, SLP has higher cycles/op (4329.138 compared to 2692.120), while IPC is comparable.
This is what I meant, actually. Just this work requires lots of time, accounting rthroughput for the "folded" loads is much easier to do.
@alexey-bataev Just to bring this discussion back, we continue seeing the huge degradation with SLP vectorizing this code. Any pointers on how we can fix this (without hack-y code downstream) is greatly appreciated. We consider folded loads in scalar cost and we just state it as "free"? llvm-mca doesn't show an accurate representation of the actual cost, since we have a loop carried dependency (which MCA doesn't model).
That's not the issue of the SLP vectorizer, definitely the issue of the cost model. I think we need to teach TTI about folded loads for some scalar operations. I had a patch for this some time before somewhere on phabricator, but it did not go. Plus, required some extra tuning, I assume.
@annamthomas Let me take another look at this
@annamthomas Are you still seeing the perf regression fixed when setting slp-threshold=2?
At that threshold the vectorised rotate is still there, but we stop replacing 4 i32 loads with a v4i32 load and 4 extracts which might have been another problem - we're probably going to be better off keeping them scalar: https://godbolt.org/z/qdzTYG9Gv
reverse-ping @annamthomas
I'll check and let you know on Monday @RKSimon (just returned from vacations).
@RKSimon Even with -slp-threshold=2, we see the same degradation. I've confirmed in the IR that we no longer replace the 4 i32 loads with a v4i32 load and 4 extracts, so it doesn't seem that's the problem here.
With SLP Vectorizer, a hot loop with 6 xors + 2 fshl get reduced to 3 xors + 1 fshl. We vectorize with a VF of 2. The SLP cost model gives it a cost of -8.
This is the loop in question:
When we vectorize it, we get:
We see about a 40% degradation on benchmark that optimizes this hot loop. The assembly for this loop shows we use 3 xor instead of vpxor and the fshl lowering using xmm registers:
While looking at cost model for X86 arithmetic instructions, I do not see anything for v2i32 for XOR. Should we actually vectorize this loop?
Will attach the IR reproducer and -slp-threshold=2 shows we only vectorize this tree and still see the 40% degradation.