Adding "minsize" attribute to function increases size of opt output 15 times

mikaelholmen commented 5 years ago


Bugzilla Link	43324
Version	trunk
OS	Linux
CC	@bjope,@fhahn,@rotateright,@yuanfang-chen

Extended Description

If we run opt like this:

opt -disable-basicaa -licm -loop-unroll -O1 -unroll-threshold=0 -o bbi-32266_2.opt.bc bbi-32266_2.ll

and check the size of the output bbi-32266_2.opt.bc I get

-rw-r--r-- 1 uabelho eusers 203308 sep 16 13:09 bbi-32266_2.opt.bc

Then, if I just add a "minsize" attribute to function main in the input and run opt the same way, I instead get this output:

-rw-r--r-- 1 uabelho eusers 3177960 sep 16 13:12 ./bbi-32266_2.opt.bc

So, adding "minsize" makes the size of the program increase around 15 times.

bjope commented 4 years ago

(For the record, I'm working on the same project as Mikael who wrote this PR, and we have just stopped using -unroll-threshold=0 to disable unrolling and we are now using -fno-unroll-loops instead, as well as TTI hooks to set the threshold to 0 in UnrollingPreferences for our target in certain situations.)

Checking "UnrollThreshold.getNumOccurrences() > 0" would make it different to set the threshold to 0 using UnrollingPreferences and when using the command line option. That might be confusing, but on the other hand those settings should have different priorities (right?). By checking getNumOccurrences() we can distinguish where the value comes from (while still using a single UP.Threshold variable), so that might be useful.

One question is if -unroll-threshold=0 on the command line should override forced unroll (e.g. by pragma). Right now it is doing that when !OptForSize, but not with OptForSize (that is weird and hard to grasp for me). Letting -unroll-threshold=0 override forced unroll regardless of OptForSize would make sense to me.

Another question is if UP.Threshold=0 set by TTI hook should override forced unroll (e.g. by pragma). Right now it is doing that when !OptForSize, but not with OptForSize (that is weird and hard to grasp for me). Here I assume that we want the pragma to have higher precedence, regardless of OptForSize. Or we should bail our regardless of OptForSize. I guess that using UP.Threshold=1 (kind of) gives the expected behavior right now, that forced unroll does not depend on OptForSize.

To summarize, I think it is weird that Threshold=0 has different impact depending on OptForSize. It could make a little bit more sense if OptForSize disabled forced unrolling in combination with Threshold=0, but right now it is doing the opposite.

About the test case:

The test case here is a little bit weird and maybe not that realistic (I think that it is a reduced test case created by some fuzzy test case generator).

By having two nestled loops with #pragma unroll count(X) and #pragma unroll count(Y) the test case actually should expect to get unrolling by a factor X*Y. So maybe this PR should ask why that only happens when we add the optsize/minsize attributes (given that UP.Threshold=0).

About the analysis at "2019-12-04 20:06:50 CET":

Sorry for not being that clear about my earlier analysis. That was kind of a follow up to the earlier comment from Florian about "There still might be an issue with the unrolling cost modeling (or further down the pipeline), if it believes the unrolled loop is the same size as the original loop.".

Given that this specific test case has forced unroll by pragma unroll count the user has requested X*Y unrolls. So it is not incorrect in itself. But I think there is an upper threshold also for pragma unroll, and that cost analysis does not work if we end up unrolling the outer loop before the inner loop (so depending on the X and Y values unrolling should be limited, right?).

I actually think the problem related to "unrolling an outer loop first, based on the cost of having a rolled inner loop, and later ending up unrolling all the inner loops and by that breaking the cost analysis done for the outer loop" could happen also without forced unroll, without OptForSize etc. It would make sense to treat that as a different problem, so I can write a separate PR for that.

fhahn commented 4 years ago

Thanks or looking into this. I think the problem is that with minsize/optsize, we do not skip the early exit checking threshold=0. We should probably respect the user provided threshold=0 as a way to disable loop-unrolling.

id you mean 'getNumOccurrences'? diff --git a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp index 4c2b079c6bb..0e9ca841b53 100644 --- a/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp +++ b/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp @@ -1050,8 +1050,7 @@ static LoopUnrollResult tryToUnrollLoop(

// Exit early if unrolling is disabled. For OptForSize, we pick the loop size // as threshold later on.

if (UP.Threshold == 0 && (!UP.Partial || UP.PartialThreshold == 0) &&
!OptForSize)
if ((UnrollThreshold.getNumOccurrences() > 0 && UP.Threshold == 0) && (!UP.Partial || UP.PartialThreshold == 0)) return LoopUnrollResult::Unmodified;

SmallPtrSet<const Value *, 32> EphValues;

But I am not sure if that's actually the right way to handle this case. IMO using -unroll-threshold=0 is kind of a hack. To disable all unrolling at the clang level, -fno-unroll-loops should be used instead.

In the test case, we have unroll counts in the metadata, and that takes precedence over any cost modeling, unless the threshold=0. I would be curious what the original input/use case was.

The observation about the handling of unrolling nested loops is interesting! But I think not directly related to the issue here.

bjope commented 4 years ago

I've been looking a bit at

opt -passes='print,unroll,print,loop(rotate),print,unroll' -unroll-threshold=0 bbi-32266_4_with_minsize.ll -o - -S -debug-only=loop-unroll

where bbi-32266_4_with_minsize.ll is the "really reduced reproducer" with a

0 added on the function to set the minsize attribute.

(1) Before the first unroll we have these loops:

Loop at depth 1 containing: %bb.7
Loop at depth 1 containing: %bb.1
,%bb.2,%bb.3,%bb.4,%bb.5,%bb.6 Loop at depth 2 containing: %bb.2
,%bb.4,%bb.5,%bb.6 Loop at depth 3 containing: %bb.4
,%bb.5,%bb.6 Loop at depth 1 containing: %bb.7

(2) And then the unroller fails to unroll the inner loops on depth 2 and 3:

Loop Unroll: F[main] Loop %bb.4 Loop Size = 7 Can't unroll; loop not terminated by a conditional branch in the latch or header. Loop Unroll: F[main] Loop %bb.2 Loop Size = 12 will not try to unroll loop with runtime trip count -unroll-runtime not given

(3) The outer loop is however unrolled:

Loop Unroll: F[main] Loop %bb.1 Loop Size = 17 Trying runtime unrolling on Loop: Loop at depth 1 containing: %bb.1
,%bb.2,%bb.3,%bb.4,%bb.5,%bb.6,%bb.4.preheader, %bb.2.loopexit,%bb.2.preheader Loop at depth 2 containing: %bb.2
,%bb.4,%bb.5,%bb.6,%bb.4.preheader,%bb.2. loopexit Loop at depth 3 containing: %bb.4
,%bb.5,%bb.6 Using prolog remainder. Loop latch not terminated by a conditional branch. UNROLLING loop %bb.1 by 63!

(4) After the first unrolling we got lots of loops, basically 63 versions of each of the inner loops (so the list below is heavily reduced):

Loop at depth 1 containing: %bb.7
Loop at depth 1 containing: %bb.1
, ............. ....................... Loop at depth 2 containing: %bb.2
,%bb.4,%bb.5,%bb.6,%bb.4.preheader,%bb.2. loopexit Loop at depth 3 containing: %bb.4
,%bb.5,%bb.6 Loop at depth 2 containing: %bb.2.1
,%bb.4.preheader.1,%bb.4.1,%bb.5.1,%bb.2.loopexit. 1,%bb.6.1 Loop at depth 3 containing: %bb.4.1
,%bb.5.1,%bb.6.1 Loop at depth 2 containing: %bb.2.2
,%bb.4.preheader.2,%bb.4.2,%bb.5.2,%bb.2.loopexit. 2,%bb.6.2 Loop at depth 3 containing: %bb.4.2
,%bb.5.2,%bb.6.2 ... ... Loop at depth 2 containing: %bb.2.61
,%bb.4.preheader.61,%bb.4.61,%bb.5.61,%bb.2. loopexit.61,%bb.6.61 Loop at depth 3 containing: %bb.4.61
,%bb.5.61,%bb.6.61 Loop at depth 2 containing: %bb.2.62
,%bb.4.preheader.62,%bb.4.62,%bb.5.62,%bb.2. loopexit.62,%bb.6.62 Loop at depth 3 containing: %bb.4.62
,%bb.5.62,%bb.6.62

Loop at depth 1 containing: %bb.7

(5) When we get to the unroller the second time (after loop rotate) we still got all these 63 loops at depth 2, each with one inner loop so we have 63 depth 3 loops as well. This time unrolling is possible for the depth 2 loops. So we kind of get the following 63 times (but for bb.5.1, %bb.5.2, ..., %bb.5.62):

Loop Unroll: F[main] Loop %bb.4.preheader.1 Loop Size = 327 will not try to unroll loop with runtime trip count -unroll-runtime not given Loop Unroll: F[main] Loop %bb.5.2 Loop Size = 6 Trying runtime unrolling on Loop: Loop at depth 3 containing: %bb.5.2
,%bb.4.2 Using prolog remainder. Multiple exit/exiting blocks in loop and multi-exit unrolling not enabled! UNROLLING loop %bb.5.2 by 53!

Whenever we decide to unroll a loop that decision is partly made based on the size of the inner loop (even for unroll by loop pragma there is one large threshold that should not be exceeded). I guess that is one reason for always unrolling the innermost loop first. But in this case we end up unrolling the outer loop first. Since the decision to unroll the outer loop is based on that we have not unrolled the inner loop (that would possibly give a much larger cost and exceeding our thresholds) it seems wrong to allow unrolling of the inner loops at the second execution of loop-unroll. That seems like a general problem with the unrolling to me.

Wouldn't it be better to mark inner loops as "already unrolled" whenever we decide to unroll an outer loop?

For example adding something like this

// Decisions for unrolling (thresholds etc) have been based on that we do not

// unroll inner loops later (even if they for example have unroll count

// pragmas). So we better mark inner loops as already unrolled. Go ahead and

// do it directly before we start to rewrite L.

for (Loop SubLoop : L)

SubLoop->setLoopAlreadyUnrolled();

in llvm::UnrollLoop at this place: https://github.com/llvm/llvm-project/blob/ 8b0780f795eb58fca0a2456e308adaaa1a0b5013/llvm/lib/Transforms/Utils/ LoopUnroll.cpp#L530

Minor correction, we actually need something like this to mark all nestled sub loops as already unrolled:

for (Loop *InnerLoop : L->getLoopsInPreorder()) if (InnerLoop != L) InnerLoop->setLoopAlreadyUnrolled();

bjope commented 4 years ago

I've been looking a bit at

opt -passes='print,unroll,print,loop(rotate),print,unroll' -unroll-threshold=0 bbi-32266_4_with_minsize.ll -o - -S -debug-only=loop-unroll

where bbi-32266_4_with_minsize.ll is the "really reduced reproducer" with a #0 added on the function to set the minsize attribute.

(1) Before the first unroll we have these loops:

Loop at depth 1 containing: %bb.7

Loop at depth 1 containing: %bb.1

,%bb.2,%bb.3,%bb.4,%bb.5,%bb.6 Loop at depth 2 containing: %bb.2

,%bb.4,%bb.5,%bb.6 Loop at depth 3 containing: %bb.4

,%bb.5,%bb.6 Loop at depth 1 containing: %bb.7

(2) And then the unroller fails to unroll the inner loops on depth 2 and 3:

Loop Unroll: F[main] Loop %bb.4 Loop Size = 7 Can't unroll; loop not terminated by a conditional branch in the latch or header. Loop Unroll: F[main] Loop %bb.2 Loop Size = 12 will not try to unroll loop with runtime trip count -unroll-runtime not given

(3) The outer loop is however unrolled:

Loop Unroll: F[main] Loop %bb.1 Loop Size = 17 Trying runtime unrolling on Loop: Loop at depth 1 containing: %bb.1

,%bb.2,%bb.3,%bb.4,%bb.5,%bb.6,%bb.4.preheader,%bb.2.loopexit,%bb.2.preheader Loop at depth 2 containing: %bb.2

,%bb.4,%bb.5,%bb.6,%bb.4.preheader,%bb.2.loopexit Loop at depth 3 containing: %bb.4

,%bb.5,%bb.6 Using prolog remainder. Loop latch not terminated by a conditional branch. UNROLLING loop %bb.1 by 63!

(4) After the first unrolling we got lots of loops, basically 63 versions of each of the inner loops (so the list below is heavily reduced):

Loop at depth 1 containing: %bb.7

Loop at depth 1 containing: %bb.1

, ............. ....................... Loop at depth 2 containing: %bb.2

,%bb.4,%bb.5,%bb.6,%bb.4.preheader,%bb.2.loopexit Loop at depth 3 containing: %bb.4

,%bb.5,%bb.6 Loop at depth 2 containing: %bb.2.1

,%bb.4.preheader.1,%bb.4.1,%bb.5.1,%bb.2.loopexit.1,%bb.6.1 Loop at depth 3 containing: %bb.4.1

,%bb.5.1,%bb.6.1 Loop at depth 2 containing: %bb.2.2

,%bb.4.preheader.2,%bb.4.2,%bb.5.2,%bb.2.loopexit.2,%bb.6.2 Loop at depth 3 containing: %bb.4.2

,%bb.5.2,%bb.6.2 ... ... Loop at depth 2 containing: %bb.2.61

,%bb.4.preheader.61,%bb.4.61,%bb.5.61,%bb.2.loopexit.61,%bb.6.61 Loop at depth 3 containing: %bb.4.61

,%bb.5.61,%bb.6.61 Loop at depth 2 containing: %bb.2.62

,%bb.4.preheader.62,%bb.4.62,%bb.5.62,%bb.2.loopexit.62,%bb.6.62 Loop at depth 3 containing: %bb.4.62

,%bb.5.62,%bb.6.62

Loop at depth 1 containing: %bb.7

(5) When we get to the unroller the second time (after loop rotate) we still got all these 63 loops at depth 2, each with one inner loop so we have 63 depth 3 loops as well. This time unrolling is possible for the depth 2 loops. So we kind of get the following 63 times (but for bb.5.1, %bb.5.2, ..., %bb.5.62):

Loop Unroll: F[main] Loop %bb.4.preheader.1 Loop Size = 327 will not try to unroll loop with runtime trip count -unroll-runtime not given Loop Unroll: F[main] Loop %bb.5.2 Loop Size = 6 Trying runtime unrolling on Loop: Loop at depth 3 containing: %bb.5.2

,%bb.4.2 Using prolog remainder. Multiple exit/exiting blocks in loop and multi-exit unrolling not enabled! UNROLLING loop %bb.5.2 by 53!

Whenever we decide to unroll a loop that decision is partly made based on the size of the inner loop (even for unroll by loop pragma there is one large threshold that should not be exceeded). I guess that is one reason for always unrolling the innermost loop first. But in this case we end up unrolling the outer loop first. Since the decision to unroll the outer loop is based on that we have not unrolled the inner loop (that would possibly give a much larger cost and exceeding our thresholds) it seems wrong to allow unrolling of the inner loops at the second execution of loop-unroll. That seems like a general problem with the unrolling to me.

Wouldn't it be better to mark inner loops as "already unrolled" whenever we decide to unroll an outer loop?

For example adding something like this

// Decisions for unrolling (thresholds etc) have been based on that we do not
// unroll inner loops later (even if they for example have unroll count
// pragmas). So we better mark inner loops as already unrolled. Go ahead and
// do it directly before we start to rewrite L.
for (Loop SubLoop : L)
SubLoop->setLoopAlreadyUnrolled();

in llvm::UnrollLoop at this place: https://github.com/llvm/llvm-project/blob/8b0780f795eb58fca0a2456e308adaaa1a0b5013/llvm/lib/Transforms/Utils/LoopUnroll.cpp#L530

fhahn commented 5 years ago

Yes, I think we have to account for user-set unrolling thresholds there I think, if -unroll-threshold=0 is used to disable unrolling.

There still might be an issue with the unrolling cost modeling (or further down the pipeline), if it believes the unrolled loop is the same size as the original loop.

mikaelholmen commented 5 years ago

Ok, so without minsize we bail out at

// Exit early if unrolling is disabled. For OptForSize, we pick the loop size // as threshold later on. if (UP.Threshold == 0 && (!UP.Partial || UP.PartialThreshold == 0) && !OptForSize) return LoopUnrollResult::Unmodified;

since we're using -unroll-threshold=0. So we get no unrolling at all.

With minsize, as the comment says we don't bail out but instead we set the threshold to

// When optimizing for size, use LoopSize + 1 as threshold (we use < Threshold // later), to (fully) unroll loops, if it does not increase code size. if (OptForSize) UP.Threshold = std::max(UP.Threshold, LoopSize + 1);

Then we end up in computeUnrollCount where we see that the loop has a loop unroll pragma and we do

// 2nd priority is unroll count set by pragma. unsigned PragmaCount = UnrollCountPragmaValue(L); if (PragmaCount > 0) { UP.Count = PragmaCount; UP.Runtime = true; UP.AllowExpensiveTripCount = true; UP.Force = true; if ((UP.AllowRemainder || (TripMultiple % PragmaCount == 0)) && getUnrolledLoopSize(LoopSize, UP) < PragmaUnrollThreshold) return true;

so then we just set the unroll count (UP.Count) to the number set in the pragma, and we don't consider the UP.Threshold that we previously calculated at all?

As far as I can see, UnrollLoop then just follows orders and unrolls the loop UP.count times, and we get the first step of code expansion.

mikaelholmen commented 5 years ago

really reduced reproducer

mikaelholmen commented 5 years ago

Ok, now really reduced reproducer. The input is just 8 BBs large and contains 8 instructions apart from the branches:

opt -loop-unroll -loop-rotate -loop-unroll -unroll-threshold=0 -o bbi-32266_4.opt.bc bbi-32266_4.ll

gives

-rw-r--r-- 1 uabelho eusers 1784 sep 25 10:15 bbi-32266_4.opt.bc

and when adding minsize to main we get

-rw-r--r-- 1 uabelho eusers 421288 sep 25 10:16 bbi-32266_4.opt.bc

The function tryToUnrollLoop in LoopUnrollPass.cpp contains this:

// When optimizing for size, use LoopSize + 1 as threshold (we use < Threshold // later), to (fully) unroll loops, if it does not increase code size. if (OptForSize) UP.Threshold = std::max(UP.Threshold, LoopSize + 1);

so the "MinSize" indeed has some special handling (which seems to ignore the -unroll-threshold=0 flag), but it seems like it fails to adhere to the "if it does not increase code size" comment above for some reason.

mikaelholmen commented 5 years ago

reduced reproducer

mikaelholmen commented 5 years ago

Slightly reduced reproducer:

opt -loop-unroll -loop-rotate -loop-unroll -unroll-threshold=0 -o bbi-32266_3.opt.bc bbi-32266_3.ll

gives

-rw-r--r-- 1 uabelho eusers 61408 sep 17 15:58 bbi-32266_3.opt.bc

then adding "minsize" to main and rerun gives

-rw-r--r-- 1 uabelho eusers 930088 sep 17 15:58 bbi-32266_3.opt.bc

mikaelholmen commented 5 years ago

Thanks for filing the issue! I do not think r364398 is the real problem, I suspect the loop unrolling cost modeling makes a wrong decision.

Yes, even the 50% increase with "minsize" is wrong to me so I suppose r364398 exposes the problem much more.

mikaelholmen commented 5 years ago

reproducer

fhahn commented 5 years ago

Thanks for filing the issue! I do not think r364398 is the real problem, I suspect the loop unrolling cost modeling makes a wrong decision.

mikaelholmen commented 5 years ago

This seems to have gotten significantly worse with r364398:

[LoopUnroll] Add support for loops with exiting headers and uncond latches.

This patch generalizes the UnrollLoop utility to support loops that exit
from the header instead of the latch. Usually, LoopRotate would take care
of must of those cases, but in some cases (e.g. -Oz), LoopRotate does
not kick in.

Codesize impact looks relatively neutral on ARM64 with -Oz + LTO.

Program                                         master     patch     diff
 External/S.../CFP2006/447.dealII/447.dealII   629060.00  627676.00  -0.2%
 External/SPEC/CINT2000/176.gcc/176.gcc        1245916.00 1244932.00 -0.1%
 MultiSourc...Prolangs-C/simulator/simulator   86100.00   86156.00    0.1%
 MultiSourc...arks/Rodinia/backprop/backprop   66212.00   66252.00    0.1%
 MultiSourc...chmarks/Prolangs-C++/life/life   67276.00   67312.00    0.1%
 MultiSourc...s/Prolangs-C/compiler/compiler   69824.00   69788.00   -0.1%
 MultiSourc...Prolangs-C/assembler/assembler   86672.00   86696.00    0.0%

Reviewers: efriedma, vsk, paquette

Reviewed By: paquette

Differential Revision: https://reviews.llvm.org/D61962

llvm-svn: 364398

If I disable that change the size of the opt output is

-rw-r--r-- 1 uabelho eusers 308556 sep 16 13:33 bbi-32266_2.opt.bc

so minsize still increases the size with 50% compared to not using minsize, but we don't get the 15 times increase.

llvm / llvm-project