Open vzakhari opened 1 year ago
Can you at least provide the final LLVM IR?
I am working on it...
Also the idea reproducer would be the full IR module before that whole final SROA pass.
I understand this, but I am not sure I can share full IR module for this benchmark. I am attaching after-all dump for resid
routine. I will try to upload a suitable reproducer, if I can. Thank you for looking into this!
@llvm/issue-subscribers-backend-x86
Oh. Wait. Somehow i thought this was pointing at the SROA patch, not the costmodel one. Oops.
So what would be useful, is the full-module IR before SLP vectorizer pass... CC @alexey-bataev
Sorry for the delay, @LebedevRI.
Please try the attached resid.ll.gz with the following compilation command:
clang -cc1 -triple x86_64-unknown-linux-gnu -emit-obj --mrelax-relocations -disable-free -clear-ast-before-backend -main-file-name mgrid.ll -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=none -menable-no-infs -menable-no-nans -fapprox-func -funsafe-math-optimizations -fno-signed-zeros -mreassociate -freciprocal-math -fdenormal-fp-math=preserve-sign,preserve-sign -ffp-contract=fast -fno-rounding-math -ffast-math -ffinite-math-only -mconstructor-aliases -funwind-tables=2 -target-cpu haswell -target-feature -avx512pf -target-feature -tsxldtrk -target-feature +cx16 -target-feature +sahf -target-feature -tbm -target-feature -avx512ifma -target-feature -sha -target-feature +crc32 -target-feature -fma4 -target-feature -vpclmulqdq -target-feature -prfchw -target-feature +bmi2 -target-feature -cldemote -target-feature +fsgsbase -target-feature -avx512bf16 -target-feature -amx-tile -target-feature -raoint -target-feature -uintr -target-feature -gfni -target-feature +popcnt -target-feature -ptwrite -target-feature +aes -target-feature -avx512bitalg -target-feature -movdiri -target-feature -widekl -target-feature -xsaves -target-feature -avx512er -target-feature -avxvnni -target-feature -avx512fp16 -target-feature -avx512vnni -target-feature -amx-bf16 -target-feature -avxvnniint8 -target-feature -avx512vpopcntdq -target-feature -pconfig -target-feature -clwb -target-feature -cmpccxadd -target-feature -avx512f -target-feature -xsavec -target-feature -clzero -target-feature -pku -target-feature -amx-fp16 -target-feature +mmx -target-feature -lwp -target-feature -rdpid -target-feature -xop -target-feature -rdseed -target-feature -waitpkg -target-feature -prefetchi -target-feature -kl -target-feature -movdir64b -target-feature -sse4a -target-feature -avx512bw -target-feature -avxneconvert -target-feature -clflushopt -target-feature +xsave -target-feature -avx512vbmi2 -target-feature +64bit -target-feature -avx512vl -target-feature -serialize -target-feature -hreset -target-feature +invpcid -target-feature -avx512cd -target-feature +avx -target-feature -vaes -target-feature -amx-int8 -target-feature +cx8 -target-feature +fma -target-feature -rtm -target-feature +bmi -target-feature -enqcmd -target-feature +rdrnd -target-feature -mwaitx -target-feature +sse4.1 -target-feature +sse4.2 -target-feature +avx2 -target-feature +fxsr -target-feature -wbnoinvd -target-feature +sse -target-feature +lzcnt -target-feature +pclmul -target-feature -rdpru -target-feature -avxifma -target-feature +f16c -target-feature +ssse3 -target-feature -sgx -target-feature -prefetchwt1 -target-feature +cmov -target-feature -avx512vbmi -target-feature -shstk -target-feature +movbe -target-feature -avx512vp2intersect -target-feature +xsaveopt -target-feature -avx512dq -target-feature +sse2 -target-feature -adx -target-feature +sse3 -mllvm -treat-scalable-fixed-error-as-warning -debugger-tuning=gdb -v -Ofast -ferror-limit 19 -fopenmp -fgnuc-version=4.2.1 -fcolor-diagnostics -vectorize-loops -vectorize-slp -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o mgrid.o -x ir resid.ll
Since this is turning into a game of remote debugging,
can you confirm that all problematic loops are in @resid_
function?
I will look at the profile that I collected, when I get access to it (the machine is down, but should be back today), and let you know about other functions/loops. Just to clarify, the attached file above holds full-module IR before SLP vectorizer pass as you requested.
Looks like another issue with the SLP cost model, small reproducer is appreciated, will check when I have a time. Working on another patch that should improve cost estimation of buildvectors/gathers and final code in SLP.
@alexey-bataev thank you for taking a look!
I looked at the profile, and I confirm that the two loops I described above are the only contributors to the slowdown. These two loops are actually a vector and scalar portion of the same loop - they appear due to LoopVectorizePass
.
Loop1 (14.155s vs 12.715s): before.dis: 0x4be0 - 0x4d6d after.dis: 0x4f90 - 0x5115
You may find the IR diff in https://github.com/llvm/llvm-project/files/10045562/before.passes.txt.gz and https://github.com/llvm/llvm-project/files/10045563/after.passes.txt.gz by looking at vector.body
block in IR Dump After SLPVectorizerPass on resid_
dump. I only see some instructions placement difference, and I am not sure whether the cost model changes are expected to cause this. I do not see any extra spills/reloads in the "after" version, so the performance change may as well be related to the loop header alignment (it was 32 before the change, and it is 16 after the change) that may affect instruction fetching and DSB usage. It would be great if you can check whether the instructions placement change after SLP is expected in this loop (given that this is a vectorized loop body created by LoopVectorizePass
, and SLP has nothing to do for it). Otherwise, I think there is nothing you can do about the performance change in this loop.
Loop2 (3.715s vs 2.55s): before.dis: 0x4ec0 - 0x506f after.dis: 0x5240 - 0x541b
This loop is more interesting. It is a fallback scalar version vectorized with ZMM/YMM gathers the results of which are then just unpacked to perform vector fadd-reduction. The shuffles seem to increase pressure on ports 0 & 5 (per profile), and I guess the critical path for getting all reduction values from the loaded data also increases.
You may find this loop in scalar.ph
blocks in https://github.com/llvm/llvm-project/files/10045562/before.passes.txt.gz and https://github.com/llvm/llvm-project/files/10045563/after.passes.txt.gz
@alexey-bataev, I may not be able to produce smaller reproducer. resid_
routine is quite small already. I think you may get the same SLP behavior by extracting just the single routine from https://github.com/llvm/llvm-project/files/10107190/resid.ll.gz
CPU2000/172.mgrid slowed down by 10% on Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz after https://reviews.llvm.org/D137913
This is a Fortran benchmark compiled with Flang. I am using a custom driver, because flang-new is not properly working right now, so I am not sure how to share a reproducer. I will try to give as much detail as I can.
The major difference is in
resid_
routine. VTune profile shows 16.225 seconds before the change, and 19.035 seconds after the change for this routine.There are two loops that show major time difference in the attached files.
before.dis.gz after.dis.gz
Loop1 (14.155s vs 12.715s): before.dis: 0x4be0 - 0x4d6d after.dis: 0x4f90 - 0x5115
Loop2 (3.715s vs 2.55s): before.dis: 0x4ec0 - 0x506f after.dis: 0x5240 - 0x541b
Loop2, in particular, does not look better to me after the change. The amount of retired instructions increased from 23960M to 27600M.
@LebedevRI, can you please take a look and see if you see any obvious issue? I will prepare LLVM IR for the related functions for reproducer.