Open martong opened 10 months ago
@llvm/issue-subscribers-backend-risc-v
Author: Gabor Marton (martong)
Now, after the offending commit, in rewriteLoopExitValues we discover that this expansion has a high cost (isHighCostExpansion), but since the loop can be deleted we do the expansion anyway. So, the offending commit is logically correct, but it reveals an underlying weakness.
I'm having a little trouble following this. Is the "weakness" here just that deleting a loop might not be profitable if it means rewriting expensive values? Or is there something else going on here?
Now, after the offending commit, in rewriteLoopExitValues we discover that this expansion has a high cost (isHighCostExpansion), but since the loop can be deleted we do the expansion anyway. So, the offending commit is logically correct, but it reveals an underlying weakness.
I'm having a little trouble following this. Is the "weakness" here just that deleting a loop might not be profitable if it means rewriting expensive values? Or is there something else going on here?
No, it is something else. I am trying to explain with other words what I mean under "weakness": Normally, it is a good heuristic to delete a loop even at a high cost of expansion. The problem is that we do delete the innermost loop, but the overall number of instructions in the whole loop nest increases.
So, indeed it is not profitable to delete the innermost loop in this case. Regular loop optimizations work on loops in isolation and they start with the innermost loop (and then go outward). If the pass took into consideration the whole loop nest, then we could have a better optimized code. And here comes my idea to consider the trip counts of each loop in the nest and the cost of the expansion. But, then there is the problem that the CFG seems to be too complex for SCEV to be able to infer the trip counts.
@javedabsar @javedabsar1 Javed, I've found your presentation about SCEV, you might find this interesting
There is a run time regression in Embench's primecount benchmark since LLVM release 15.0. I have found with git bisect that the offending change is https://reviews.llvm.org/D129710 .
I have made the measurements with
-target riscv32 -march=rv32imc -mabi=ilp32
and I've been using an instruction accurate simulator which handles all instruction as taking exactly one cycle. The number of cycles is ~50% more in LLVM 15.0 than in previous releases.Primecount counts the number of primes up to a certain constant, it contains a loop nest of depth 3 and there are
goto
s to break out from the middle loop.I could reduce the initial code (linked above) to this simplest case where there is still some meaningful difference in the run time:
which is equivalent to the below code if we get rid of the
goto
:What happens here is that the offending commit enables an optimization of the induction variable
c[e]
in the innermost loop. DuringrewriteLoopExitValues
we move the calculation of the final exit value of the IV into the loop's preheader.This was not happening before the offending commit because the SCEV expression that divides with
1 umax %0
was considered unsafe to expand (isSafeToExpand
). Now, after the offending commit, inrewriteLoopExitValues
we discover that this expansion has a high cost (isHighCostExpansion
), but since the loop can be deleted we do the expansion anyway. So, the offending commit is logically correct, but it reveals an underlying weakness.I was thinking about as a possible solution, to consider the product of the trip counts of the containing loops and compare that to the gain we have with the expansion. More formally, do the expansion only if
I * J * X < I * J * X'
, whereI
andJ
are the trip count of the outermost and middle loops.X
is the number of all operations before expansion in the innermost loop,X'
is the same, but after the expansion; this needs to calculate with the trip count of the innermost loop. The problem is that SCEV is not able to infer the loop trip count (backedge taken count neither) for neither of these loops. This is perhaps because the induction variable of the innermost loop is dependent on the IV of an outer loop. Any pointers on SCEV's limitation could be helpful at this time. I was also experimenting to useAttributor
'sAAPotentialValues
to see if the dataflow framework could discover the value ofc[e]
, unfortunately it could not. (It reached the top.)Any feed-back is more than welcome. I've assigned people with history in the related components, my intention was solely to draw their attention. Other than this, perhaps discussing the issue and possible fix would be more appropriate to be done in discourse and an RFC?