CPU2000/171.swim performance regression on aarch64 after D137580

vzakhari commented 1 year ago

With https://reviews.llvm.org/D137580 Flang started propagating all fast-math flags to LLVM (before the change Flang only passed ninf and contract).

The benchmark used to run for about 26 seconds, and after the change is takes about 28 seconds on Ampere Altra - about 7.5% slowdown.

perf identified the following difference in _QQmain:

before	after
``` Children Self Samples Command Shared Object Symbol + 40.76% 40.55% 42639 swim swim [.] _QQmain + 22.55% 22.55% 23718 swim swim [.] calc2_ + 19.14% 19.13% 20115 swim swim [.] calc1_ + 17.30% 17.30% 18184 swim swim [.] calc3_ ```	``` Children Self Samples Command Shared Object Symbol 44.37% 44.15% 50484 swim swim [.] _QQmain + 21.32% 21.31% 24375 swim swim [.] calc2_ + 17.83% 17.82% 20378 swim swim [.] calc1_ 16.27% 16.27% 18601 swim swim [.] calc3_ ```
``` │310:┌─→ldr d0, [x14] 10365 │ │ add x14, x14, x29 │ │ ldr d1, [x15] 11187 │ │ add x15, x15, x29 │ │ ldr d2, [x16] 19118 │ │ add x16, x16, x29 │ │ fabs d0, d0 420 │ │ subs x13, x13, #0x1 │ │ fabs d1, d1 412 │ │ fabs d2, d2 │ │ fadd d10, d10, d0 537 │ │ fadd d9, d9, d1 │ │ fadd d8, d8, d2 539 │ └──b.ne 310 ```	``` 181 │3e0:┌─→ldr d3, [x17] 3895 │ │ subs x1, x1, #0x2 114 │ │ ldr d4, [x18] 5771 │ │ add x18, x18, x25 61 │ │ ldr d5, [x0] 13664 │ │ add x17, x17, x25 124 │ │ ldr d6, [x0, #10680] 11955 │ │ fabs d3, d3 163 │ │ ldr d7, [x2] 7542 │ │ fabs d4, d4 80 │ │ ldr d16, [x3] 5377 │ │ fabs d5, d5 46 │ │ fabs d6, d6 96 │ │ add x3, x3, x25 39 │ │ fabs d7, d7 205 │ │ fadd d10, d3, d10 135 │ │ fabs d16, d16 208 │ │ fadd d2, d4, d2 85 │ │ fadd d9, d5, d9 141 │ │ add x2, x2, x25 56 │ │ fadd d1, d6, d1 68 │ │ add x0, x0, x25 49 │ │ fadd d8, d7, d8 214 │ │ fadd d0, d16, d0 163 │ └──b.ne 3e0 ```

The difference is caused by LoopVectorizePass that unrolls the loop by 2 and ends up not vectorizing it.

The attached files provide LLVM IR for _QQmain:

main.ll.gz - original IR with fast
main_nofast.ll.gz - modified IR with fast replaced by ninf contract just for this loop; this restores performance to 26 seconds.

The vectorizer behavior may be reproduced with:

clang -cc1 -triple aarch64-unknown-linux-gnu -emit-obj --mrelax-relocations -disable-free -clear-ast-before-backend -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=non-leaf -menable-no-infs -menable-no-nans -fapprox-func -funsafe-math-optimizations -fno-signed-zeros -mreassociate -freciprocal-math -ffp-contract=fast -fno-rounding-math -ffast-math -ffinite-math-only -mconstructor-aliases -funwind-tables=2 -target-cpu generic -target-feature +neon -target-feature +v8.2a -target-abi aapcs -mllvm -treat-scalable-fixed-error-as-warning -debugger-tuning=gdb -v -Ofast -ferror-limit 19 -fopenmp -fno-signed-char -fgnuc-version=4.2.1 -fcolor-diagnostics -vectorize-loops -vectorize-slp -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o main.o -x ir main.ll

@kiranchandramohan, can you please take a look? Is there something obviously wrong with the generated code?

vzakhari commented 1 year ago

FWIW, the original loop nest looks like this:

 114         DO 3500 ICHECK = 1, MNMIN
 115          DO 4500 JCHECK = 1, MNMIN
 116          PCHECK = PCHECK + ABS(PNEW(ICHECK,JCHECK))
 117          UCHECK = UCHECK + ABS(UNEW(ICHECK,JCHECK))
 118          VCHECK = VCHECK + ABS(VNEW(ICHECK,JCHECK))
 119  4500   CONTINUE
 120         UNEW(ICHECK,ICHECK) = UNEW(ICHECK,ICHECK)
 121      1  * ( MOD (ICHECK, 100) /100.)
 122  3500   CONTINUE

kiranchandramohan commented 1 year ago

Apologies @vzakhari , I completely missed this. We will have a look this week.

vzakhari commented 1 year ago

No problem, Kiran! Thank you for the help!

llvm / llvm-project

CPU2000/171.swim performance regression on aarch64 after D137580 #59274