Open vzakhari opened 1 year ago
FWIW, the original loop nest looks like this:
114 DO 3500 ICHECK = 1, MNMIN
115 DO 4500 JCHECK = 1, MNMIN
116 PCHECK = PCHECK + ABS(PNEW(ICHECK,JCHECK))
117 UCHECK = UCHECK + ABS(UNEW(ICHECK,JCHECK))
118 VCHECK = VCHECK + ABS(VNEW(ICHECK,JCHECK))
119 4500 CONTINUE
120 UNEW(ICHECK,ICHECK) = UNEW(ICHECK,ICHECK)
121 1 * ( MOD (ICHECK, 100) /100.)
122 3500 CONTINUE
Apologies @vzakhari , I completely missed this. We will have a look this week.
No problem, Kiran! Thank you for the help!
With https://reviews.llvm.org/D137580 Flang started propagating all fast-math flags to LLVM (before the change Flang only passed
ninf
andcontract
).The benchmark used to run for about 26 seconds, and after the change is takes about 28 seconds on Ampere Altra - about 7.5% slowdown.
perf identified the following difference in
_QQmain
:The difference is caused by
LoopVectorizePass
that unrolls the loop by 2 and ends up not vectorizing it.The attached files provide LLVM IR for
_QQmain
:fast
fast
replaced byninf contract
just for this loop; this restores performance to 26 seconds.The vectorizer behavior may be reproduced with:
@kiranchandramohan, can you please take a look? Is there something obviously wrong with the generated code?