llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.11k stars 12.01k forks source link

Block Frequencies for loop seem incorrect #28165

Open AlexisPerry opened 8 years ago

AlexisPerry commented 8 years ago
Bugzilla Link 27791
Version trunk
OS All
Attachments source file
CC @vns-mn,@hfinkel,@rotateright

Extended Description

$ clang++ -O2 -fprofile-instr-generate ProfileTester.cpp -o ProfileTester -fno-vectorize -fno-unroll-loops $ LLVM_PROFILE_FILE="profiletester-%p.profraw" ./ProfileTester Number of entries into the if statement: 11 The new x value is: 2.07692 The new c value is: 2.69231 $ llvm-profdata merge -output=profiletester.profdata profiletester-*.profraw $ clang++ -O2 -fprofile-instr-use=profiletester.profdata ProfileTester.cpp -S -emit-llvm -o ProfileTester-with-pd.ll -fno-vectorize -fno-unroll-loops $ opt -analyze -block-freq < ProfileTester-with-pd.ll Printing analysis 'Block Frequency Analysis' for function 'main': block-frequency-info: main

Printing analysis 'Block Frequency Analysis' for function '_GLOBAL__sub_I_ProfileTester.cpp': block-frequency-info: _GLOBAL__sub_I_ProfileTester.cpp

Please note:

This is a top-level loop with a static trip count of 10000. Thus, its loop header should execute 10000 times as frequently as the function entry block. The reported frequency is ~5000 times. This seems off by a factor of 2.

david-xl commented 8 years ago

FE based PGO uses the Laplace rule of succession to 'normalize' the branch weight (see following code).

What happens is that for the loop backedge in the test case, it has the effect of reduce the loop trip count by half when the 'normalized' weight is fed to the BFI.

IR based PGO does not have this rule.

Try

-fprofile-instr-generate -Xclang -fprofile-instrument=llvm to turn on IR based instrumentation.

(the profile use command line is the same as FE based, ie., no additional option is needed to turn it on).

/// According to Laplace's Rule of Succession, it is better to compute the /// weight based on the count plus 1, so universally add 1 to the value. /// /// \pre \c Scale was calculated by \a calculateWeightScale() with a weight no /// greater than \c Weight. static uint32_t scaleBranchWeight(uint64_t Weight, uint64_t Scale) { assert(Scale && "scale by 0?"); uint64_t Scaled = Weight / Scale + 1; assert(Scaled <= UINT32_MAX && "overflow 32-bits"); return Scaled; }