llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.48k stars 11.77k forks source link

CPU2000/172.mgrid performance regression after D137913 #59075

Open vzakhari opened 1 year ago

vzakhari commented 1 year ago

CPU2000/172.mgrid slowed down by 10% on Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz after https://reviews.llvm.org/D137913

This is a Fortran benchmark compiled with Flang. I am using a custom driver, because flang-new is not properly working right now, so I am not sure how to share a reproducer. I will try to give as much detail as I can.

The major difference is in resid_ routine. VTune profile shows 16.225 seconds before the change, and 19.035 seconds after the change for this routine.

There are two loops that show major time difference in the attached files.

before.dis.gz after.dis.gz

Loop1 (14.155s vs 12.715s): before.dis: 0x4be0 - 0x4d6d after.dis: 0x4f90 - 0x5115

Loop2 (3.715s vs 2.55s): before.dis: 0x4ec0 - 0x506f after.dis: 0x5240 - 0x541b

Loop2, in particular, does not look better to me after the change. The amount of retired instructions increased from 23960M to 27600M.

@LebedevRI, can you please take a look and see if you see any obvious issue? I will prepare LLVM IR for the related functions for reproducer.

LebedevRI commented 1 year ago

Can you at least provide the final LLVM IR?

vzakhari commented 1 year ago

I am working on it...

LebedevRI commented 1 year ago

Also the idea reproducer would be the full IR module before that whole final SROA pass.

vzakhari commented 1 year ago

I understand this, but I am not sure I can share full IR module for this benchmark. I am attaching after-all dump for resid routine. I will try to upload a suitable reproducer, if I can. Thank you for looking into this!

before.passes.txt.gz after.passes.txt.gz

llvmbot commented 1 year ago

@llvm/issue-subscribers-backend-x86

LebedevRI commented 1 year ago

Oh. Wait. Somehow i thought this was pointing at the SROA patch, not the costmodel one. Oops.

LebedevRI commented 1 year ago

So what would be useful, is the full-module IR before SLP vectorizer pass... CC @alexey-bataev

vzakhari commented 1 year ago

Sorry for the delay, @LebedevRI.

Please try the attached resid.ll.gz with the following compilation command:

clang -cc1 -triple x86_64-unknown-linux-gnu -emit-obj --mrelax-relocations -disable-free -clear-ast-before-backend -main-file-name mgrid.ll -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=none -menable-no-infs -menable-no-nans -fapprox-func -funsafe-math-optimizations -fno-signed-zeros -mreassociate -freciprocal-math -fdenormal-fp-math=preserve-sign,preserve-sign -ffp-contract=fast -fno-rounding-math -ffast-math -ffinite-math-only -mconstructor-aliases -funwind-tables=2 -target-cpu haswell -target-feature -avx512pf -target-feature -tsxldtrk -target-feature +cx16 -target-feature +sahf -target-feature -tbm -target-feature -avx512ifma -target-feature -sha -target-feature +crc32 -target-feature -fma4 -target-feature -vpclmulqdq -target-feature -prfchw -target-feature +bmi2 -target-feature -cldemote -target-feature +fsgsbase -target-feature -avx512bf16 -target-feature -amx-tile -target-feature -raoint -target-feature -uintr -target-feature -gfni -target-feature +popcnt -target-feature -ptwrite -target-feature +aes -target-feature -avx512bitalg -target-feature -movdiri -target-feature -widekl -target-feature -xsaves -target-feature -avx512er -target-feature -avxvnni -target-feature -avx512fp16 -target-feature -avx512vnni -target-feature -amx-bf16 -target-feature -avxvnniint8 -target-feature -avx512vpopcntdq -target-feature -pconfig -target-feature -clwb -target-feature -cmpccxadd -target-feature -avx512f -target-feature -xsavec -target-feature -clzero -target-feature -pku -target-feature -amx-fp16 -target-feature +mmx -target-feature -lwp -target-feature -rdpid -target-feature -xop -target-feature -rdseed -target-feature -waitpkg -target-feature -prefetchi -target-feature -kl -target-feature -movdir64b -target-feature -sse4a -target-feature -avx512bw -target-feature -avxneconvert -target-feature -clflushopt -target-feature +xsave -target-feature -avx512vbmi2 -target-feature +64bit -target-feature -avx512vl -target-feature -serialize -target-feature -hreset -target-feature +invpcid -target-feature -avx512cd -target-feature +avx -target-feature -vaes -target-feature -amx-int8 -target-feature +cx8 -target-feature +fma -target-feature -rtm -target-feature +bmi -target-feature -enqcmd -target-feature +rdrnd -target-feature -mwaitx -target-feature +sse4.1 -target-feature +sse4.2 -target-feature +avx2 -target-feature +fxsr -target-feature -wbnoinvd -target-feature +sse -target-feature +lzcnt -target-feature +pclmul -target-feature -rdpru -target-feature -avxifma -target-feature +f16c -target-feature +ssse3 -target-feature -sgx -target-feature -prefetchwt1 -target-feature +cmov -target-feature -avx512vbmi -target-feature -shstk -target-feature +movbe -target-feature -avx512vp2intersect -target-feature +xsaveopt -target-feature -avx512dq -target-feature +sse2 -target-feature -adx -target-feature +sse3 -mllvm -treat-scalable-fixed-error-as-warning -debugger-tuning=gdb -v -Ofast -ferror-limit 19 -fopenmp -fgnuc-version=4.2.1 -fcolor-diagnostics -vectorize-loops -vectorize-slp -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o mgrid.o -x ir resid.ll
LebedevRI commented 1 year ago

Since this is turning into a game of remote debugging, can you confirm that all problematic loops are in @resid_ function?

vzakhari commented 1 year ago

I will look at the profile that I collected, when I get access to it (the machine is down, but should be back today), and let you know about other functions/loops. Just to clarify, the attached file above holds full-module IR before SLP vectorizer pass as you requested.

alexey-bataev commented 1 year ago

Looks like another issue with the SLP cost model, small reproducer is appreciated, will check when I have a time. Working on another patch that should improve cost estimation of buildvectors/gathers and final code in SLP.

LebedevRI commented 1 year ago

@alexey-bataev thank you for taking a look!

vzakhari commented 1 year ago

I looked at the profile, and I confirm that the two loops I described above are the only contributors to the slowdown. These two loops are actually a vector and scalar portion of the same loop - they appear due to LoopVectorizePass.

Loop1 (14.155s vs 12.715s): before.dis: 0x4be0 - 0x4d6d after.dis: 0x4f90 - 0x5115

You may find the IR diff in https://github.com/llvm/llvm-project/files/10045562/before.passes.txt.gz and https://github.com/llvm/llvm-project/files/10045563/after.passes.txt.gz by looking at vector.body block in IR Dump After SLPVectorizerPass on resid_ dump. I only see some instructions placement difference, and I am not sure whether the cost model changes are expected to cause this. I do not see any extra spills/reloads in the "after" version, so the performance change may as well be related to the loop header alignment (it was 32 before the change, and it is 16 after the change) that may affect instruction fetching and DSB usage. It would be great if you can check whether the instructions placement change after SLP is expected in this loop (given that this is a vectorized loop body created by LoopVectorizePass, and SLP has nothing to do for it). Otherwise, I think there is nothing you can do about the performance change in this loop.

Loop2 (3.715s vs 2.55s): before.dis: 0x4ec0 - 0x506f after.dis: 0x5240 - 0x541b

This loop is more interesting. It is a fallback scalar version vectorized with ZMM/YMM gathers the results of which are then just unpacked to perform vector fadd-reduction. The shuffles seem to increase pressure on ports 0 & 5 (per profile), and I guess the critical path for getting all reduction values from the loaded data also increases.

You may find this loop in scalar.ph blocks in https://github.com/llvm/llvm-project/files/10045562/before.passes.txt.gz and https://github.com/llvm/llvm-project/files/10045563/after.passes.txt.gz

@alexey-bataev, I may not be able to produce smaller reproducer. resid_ routine is quite small already. I think you may get the same SLP behavior by extracting just the single routine from https://github.com/llvm/llvm-project/files/10107190/resid.ll.gz