Gigantic block frequencies produced by clang in PGO

llvmbot commented 9 years ago


Bugzilla Link	23758
Version	trunk
OS	All
Attachments	CFG with edge frequencies.
Reporter	LLVM Bugzilla Contributor
CC	@vns-mn,@rotateright,@yuanfang-chen

Extended Description

When compiling spec2000 in PGO mode, I noticed many very large block frequencies for some programs. They are too big to be true. One example is 164.gzip in spec2000. I made the profile data from running the binary with options "input.random 60". The produced CFG is attached here, which shows the hottest block has the frequency over 14E12. Note that the frequency of the entry node is only 17, which means that the hottest node runs almost 1E12 times more that the entry.

To find out the actual frequency of that hot node, I profiled the same program using GCC, and also inserted a counter manually to the program. Both showed that the hottest node ran only about 4E7 times more than the entry.

So I think there may be something wrong in LLVM's frequency calculation.

(In the attached CFG, the numbers on edges are frequencies.)

llvmbot commented 9 years ago

The CFG of test.ll after jump-threading+loop-rotate+jump-threading passes.

llvmbot commented 9 years ago

The CFG of the original test.ll.

llvmbot commented 9 years ago

I have produced the CFG for test.ll when !3 = !{!"branch_weights", i32 100, i32 1}. I think this weight makes more sense (although it is not the only valid one). The PDF file is attached here (numbers are edge frequencies).

The loop scales computed after jump-threading + loop-rotate + jump-threading are:

opt test.ll -block-freq -debug-only=block-freq 2>&1 | grep "- scale ="

scale = 101.0
scale = 101.0

opt test.ll -jump-threading -loop-rotate -jump-threading -block-freq -debug-only=block-freq 2>&1 | grep "- scale ="

scale = 1.00990001
scale = 1020201.016

Still quite wrong.

I also attached a CFG after jump-threading + loop-rotate + jump-threading passes. The reason why the inner loop scale is quite incorrect is because the edge from land.end to while.condthread-pre-split.loopexit to while.end.8 is not going back to the inner loop (while it was originally). This is the same issue that the patch http://reviews.llvm.org/D10979 is solving. After I applied this patch the loop scales can now be correctly calculated.

Therefore, this test case cannot reproduce the problem I want to describe. I will synthesize a new one later.

david-xl commented 9 years ago

I think I modified the test case by mistake before I attached it. The result given by me was produced when !3 = !{!"branch_weights", i32 1, i32 100}.

It seems to me that the only possible branch weight here is {0, 100} due to the static conditions.

Given !3 = ! { .... i32 0, i32 100} and run -block-freq only, you can find that the result is quite correct. while.body3 etc blocks should have frequency == 0, instead of 1.0. The different looks small, but it may affect branch probability update later a lot.

For the test case from you, there are two loops instead of three (note that there are only two loop headers: while.body and while.cond.1).

while.body has two backedges though.

The weight of !2 seems incorrect as in this way there is no exit from the loop and we got infinite loops. That is why we got a huge number as the loop scale. If we don't run jump-threading pass on this test case we still get the same scale:

Right. If profile data is updated correctly, the final profile data after the transformation (jump-threading, loop rotation, jump-threading) should look like the program pasted below (The block frequency propagation still has problems here. for instance cond.1 BB should have freq ~100, not ~200).

To correctly update the profile data, one possible way is to first compute block frequency before jump-threading. After jump-threading is done, update bock frequencies. Later using flow-constraint to recompute out going cFG edge's frequencies that are affected. The branch prob data is then updated.

For instance before the second jumpthreading, we have the following:

entry: float = 1.0
while.condthread-pre-split.loopexit: float = 1.0
while.cond.loopexit: float = 100.0
while.body: float = 101.0
while.cond.1: float = 101.0
land.end: float = 1.0
while.end.8: float = 1.0

From above we know that edge cond1->while.cond.loop.exit has frequency 100.0, thus edge while.cond.loopexit -> while.end.8 has frequency 0.

Similarly, entry->while.end.8 has frequency 0 and edge while.condthread-pre-split.loopexit->while.end.8 has frequency 0.

Other solution does not seem to be unique though. The following seems possible:

entry->while.body : 0.5 entry->while.end.8 :0.5 while.condthread-pre-split.loopexit->while.end.8 : 0.5 while.condthread-pre-split.loopexit->while.body : 0.5

but they should later yield same BB frequency ..

// expected IL with the right profile data after jumpthreading+rotation+jumpthreading

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu"

@lookahead = global i32 0, align 4 @eofile = global i32 0, align 4 @n = global i32 0, align 4 @m = global i32 0, align 4

define void @foo() { entry: %.pr.3 = load i32, i32* @lookahead, align 4 %cmp.2.4 = icmp ne i32 %.pr.3, 0 br i1 %cmp.2.4, label %while.body, label %while.end.8, !prof !0

while.condthread-pre-split.loopexit: ; preds = %land.end %.pr = load i32, i32* @lookahead, align 4 %cmp.2 = icmp ne i32 %.pr, 0 br i1 %cmp.2, label %while.body, label %while.end.8, !prof !3

while.body: ; preds = %while.cond.1, %while.condthread-pre-split.loopexit, %entry %0 = load i32, i32* @m, align 4 %inc = add nsw i32 %0, 1 br label %while.cond.1

while.cond.1: ; preds = %land.end, %while.body %1 = load i32, i32* @lookahead, align 4 %cmp2 = icmp slt i32 %1, 10 br i1 %cmp2, label %land.end, label %while.body, !prof !1

land.end: ; preds = %while.cond.1 %2 = load i32, i32* @eofile, align 4 %tobool = icmp ne i32 %2, 0 %lnot = xor i1 %tobool, true br i1 %lnot, label %while.cond.1, label %while.condthread-pre-split.loopexit, !prof !2

while.end.8: ; preds = %while.condthread-pre-split.loopexit, %entry ret void }

!0 = !{!"branch_weights", i32 1, i32 0} !1 = !{!"branch_weights", i32 1, i32 100} !2 = !{!"branch_weights", i32 0, i32 1} !3 = !{!"branch_weights", i32 0, i32 1}

llvmbot commented 9 years ago

Corrected test case

llvmbot commented 9 years ago

I think I modified the test case by mistake before I attached it. The result given by me was produced when !3 = !{!"branch_weights", i32 1, i32 100}.

For the test case from you, there are two loops instead of three (note that there are only two loop headers: while.body and while.cond.1). The weight of !2 seems incorrect as in this way there is no exit from the loop and we got infinite loops. That is why we got a huge number as the loop scale. If we don't run jump-threading pass on this test case we still get the same scale:

opt test2.ll -jump-threading -loop-rotate -jump-threading -simplifycfg -block-freq -debug-only=block-freq 2>&1 | grep "- scale ="

scale = 1.009999899
scale = 1008208386.5 opt test2.ll -simplifycfg -block-freq -debug-only=block-freq 2>&1 | grep "- scale ="
scale = 1.009999899
scale = 1008208386.5

The attached test case, however, has different scales when we run jump-threading or not:

opt test.ll -jump-threading -loop-rotate -jump-threading -simplifycfg -block-freq -debug-only=block-freq 2>&1 | grep "- scale ="

scale = 1.000098039
scale = 10302.00018 opt test.ll -simplifycfg -block-freq -debug-only=block-freq 2>&1 | grep "- scale ="
scale = 1.000098039
scale = 101.0

david-xl commented 9 years ago

Created attachment 14607 [details] Updated test case.

To match the test case, the branch prob of land.end can not be {100, 100}, but more of {9900,990099}

!3 = !{!"branch_weights", i32 9900, i32 990099}

(The weight values are scaled up to preserve precision of very small probability value -- I think this needs to be done in profile update)

With that fixed, I run though the three passes and manually fixed up/updated the MD_prof data and the probability data looks sane in the final IR dump (see attached file). However block freq analysis still reports wrong loop scale (it also only reports 2 loops instead of 3).

Can you take a look at that? It seems to be a bug in block freq propagation algorithm:

; ModuleID = 'test2.ll' target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128" target triple = "x86_64-unknown-linux-gnu"

@lookahead = global i32 0, align 4 @eofile = global i32 0, align 4 @n = global i32 0, align 4 @m = global i32 0, align 4

define void @foo() { entry: %.pr.3 = load i32, i32* @lookahead, align 4 %cmp.2.4 = icmp ne i32 %.pr.3, 0 br i1 %cmp.2.4, label %while.body, label %while.end.8, !prof !0

while.condthread-pre-split.loopexit: ; preds = %land.end %.pr = load i32, i32* @lookahead, align 4 %cmp.2 = icmp ne i32 %.pr, 0 br i1 %cmp.2, label %while.body, label %while.end.8, !prof !0

while.body: ; preds = %while.cond.1, %while.condthread-pre-split.loopexit, %entry %0 = load i32, i32* @m, align 4 %inc = add nsw i32 %0, 1 br label %while.cond.1

while.cond.1: ; preds = %land.end, %while.body %1 = load i32, i32* @lookahead, align 4 %cmp2 = icmp slt i32 %1, 10 br i1 %cmp2, label %land.end, label %while.body, !prof !1

land.end: ; preds = %while.cond.1 %2 = load i32, i32* @eofile, align 4 %tobool = icmp ne i32 %2, 0 %lnot = xor i1 %tobool, true br i1 %lnot, label %while.cond.1, label %while.condthread-pre-split.loopexit, !prof !2

while.end.8: ; preds = %while.condthread-pre-split.loopexit, %entry ret void }

!0 = !{!"branch_weights", i32 100, i32 1} !1 = !{!"branch_weights", i32 1, i32 100} !2 = !{!"branch_weights", i32 100000, i32 0}

llvmbot commented 9 years ago

Updated test case.

llvmbot commented 9 years ago

I have updated the test case by adding the branch prof data for land.end. Now we could see the loop scale calculated for the inner loop is wrong with jump threading:

opt test.ll -block-freq -debug-only=block-freq 2>&1 | grep "- scale ="

scale = 1.01
scale = 101.0 opt test.ll -jump-threading -loop-rotate -jump-threading -simplifycfg -block-freq -debug-only=block-freq 2>&1 | grep "- scale ="
scale = 1.000098039
scale = 10302.00018

david-xl commented 9 years ago

Created attachment 14571 [details] Test case that shows incorrect loop scales due to JumpThreading.

In the test case, it seems that the there is a missing MD_prof data for the branch at BB land.end -- the inner loop's scale depends on it.

llvmbot commented 9 years ago

Test case that shows incorrect loop scales due to JumpThreading.

llvmbot commented 9 years ago

My first conclusion is that in jump threading pass the branch weight metadata is not updated after CFG modification. This may lead to the loop scale being incorrectly calculated. I proposed a patch to fix this issue: http://reviews.llvm.org/D10979

After deep investigation into this issue, I found in some cases it is difficult to prevent LLVM from calculating a huge loop scale during block frequency analysis. Here is a scenario when this happens:

Say we have a while loop whose only exit is the loop header and it has two backedges, and also assume the weight on the exit edge is 1 and the sum of weights on other out-going edges from the header is 100 (this means this loop has 100 iterations). With loop rotation, we may get two loop exits that are generated from the source nodes of back edges. Then at each exit, the weights on the exit edge and backedge are still 1 and 100 (which makes sense as we don't know from which backedge the loop is more likely to exit previously). Then during jump threading analysis, we found that one of the exit won't be taken at all, and that exit will be removed later. This can lead to larger loop scale being calculated during block frequency analysis. For example, assume those two backedges are equally likely to be taken, then the frequencies (actually should be the mass) of both backedges and the exit edge are roughly 200 and 1, resulting in 200 iterations. When the backedge whose source node is the only exit of the loop is much less likely to be taken than the other backedge, we will get huge iteration number.

I have attached a test case that can reproduce this issue. To run the test, use the following commands:

opt test.ll -block-freq -debug-only=block-freq 2>&1 | grep "- scale ="

scale = 32.0
scale = 101.0 opt test.ll -jump-threading -loop-rotate -jump-threading -simplifycfg -block-freq -debug-only=block-freq 2>&1 | grep "- scale ="
scale = 1.009684474
scale = 323301.0964

In this case the scale for both loop are miscalculated.

llvmbot commented 9 years ago

I found the problem is in the pass "JumpThreading". Once I disabled this pass, the block frequencies can be correctly calculated.

With this pass enabled, the edge weights stored in BasicBlock can be incorrect. I suspect that the JumpThreading pass fails to produce the edge weight for newly generated blocks.

llvmbot commented 9 years ago

Running llvm-profdata on the profile data file gives the correct looking counters as shown below. Note that the function deflate has the largest counter 188695485, which is much smaller than the calculated frequency 14E12. So I suspect there is some defects when calculating block frequencies.

$ llvm-profdata show 4.profdata --function=deflate --counts Counters: deflate.c:fill_window: Hash: 0x000a28434434a28f Counters: 10 Function count: 9595 Block counts: [0, 9595, 314408960, 197836636, 314408960, 114068048, 9595, 5, 9595] deflate: Hash: 0x6b160b21ab876ffc Counters: 21 Function count: 5 Block counts: [2, 188695485, 118720662, 144418146, 144418146, 3, 319026, 391833, 46755, 46755, 1440, 0, 0, 188601972, 5757, 5757, 5757, 6543, 3, 3] deflate.c:deflate_fast: Hash: 0x8fd76a44e4ad645e Counters: 13 Function count: 2 Block counts: [125342756, 78862029, 95930036, 2, 242702, 242702, 243662, 3824, 3784, 3838, 4362, 2] Functions shown: 3 Total functions: 96 Maximum function count: 313991486 Maximum internal block count: 629126405

llvmbot commented 9 years ago

assigned to @dnovillo

llvm / llvm-project

Gigantic block frequencies produced by clang in PGO #24132

Extended Description