Open rlavaee opened 2 years ago
@llvm/issue-subscribers-bolt
Thanks for letting know. Are you optimizing the same clang-15 binary as before? Do you have dynostats from previous BOLT where you saw larger gains?
Unfortunately, I don't have stats from builds with larger gains now. Also, my old perf2bolt (compiled about 1 year ago from incubator repo) fails to run on this binary.
PERF2BOLT: out of range traces involving unknown regions: 2688310 (12.7%)
perf2bolt: $$$$/bolt/src/BinaryContext.cpp:764: void llvm::bolt::BinaryContext::populateJumpTables(): Assertion `0 && "unclaimed PC-relative relocations left in data\n"' failed.
#0 0x0000559a4c4e3fb0 PrintStackTraceSignalHandler(void*) Signals.cpp:0:0
#1 0x0000559a4c4e1d4e SignalHandler(int) Signals.cpp:0:0
#2 0x00007f7385626200 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x12200)
#3 0x00007f738509b8a1 raise ./signal/../sysdeps/unix/sysv/linux/raise.c:50:1
#4 0x00007f7385085546 abort ./stdlib/abort.c:81:7
#5 0x00007f738508542f get_sysdep_segment_value ./intl/loadmsgcat.c:509:8
#6 0x00007f738508542f _nl_load_domain ./intl/loadmsgcat.c:970:34
#7 0x00007f7385094222 (/lib/x86_64-linux-gnu/libc.so.6+0x31222)
#8 0x0000559a4b6e3593 llvm::bolt::BinaryContext::populateJumpTables() (${HOME}/copt/build/bolt_binaries/perf2bolt+0x231593)
#9 0x0000559a4b7b9831 llvm::bolt::RewriteInstance::disassembleFunctions() (${HOME}/copt/build/bolt_binaries/perf2bolt+0x307831)
#10 0x0000559a4b8121ea llvm::bolt::RewriteInstance::run() (${HOME}/copt/build/bolt_binaries/perf2bolt+0x3601ea)
#11 0x0000559a4b6685e9 main (${HOME}/copt/build/bolt_binaries/perf2bolt+0x1b65e9)
#12 0x00007f73850867fd __libc_start_main ./csu/../csu/libc-start.c:332:16
#13 0x0000559a4b6bf4da _start (${HOME}/copt/build/bolt_binaries/perf2bolt+0x20d4da)
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0. Program arguments: ${HOME}/copt/build/bolt_binaries/perf2bolt -o pgo-labels.fdata -w pgo-labels-compiler.yaml -p pgo-labels.perfdata ${HOME}/copt/source/llvm-project/relwithdeb/pgo-labels/build/bin/clang-15
Can you try —strict=0
?
Old llvm-bolt works with -strict=0
, but I am getting regression and dyno_stats are consistent with regression:
10634721 : executed forward branches
1291 : taken forward branches
3094667 : executed backward branches
348 : taken backward branches
1950602 : executed unconditional branches
5528009 : all function calls
1434116 : indirect calls
1177373 : PLT calls
112346938 : executed instructions
26424638 : executed load instructions
12891276 : executed store instructions
56880 : taken jump table branches
0 : taken unknown indirect branches
15679990 : total branches
1952241 : taken branches
13727749 : non-taken conditional branches
1639 : taken conditional branches
13729388 : all conditional branches
11948700 : executed forward branches (+12.4%)
908 : taken forward branches (-29.7%)
1780688 : executed backward branches (-42.5%)
1283 : taken backward branches (+268.7%)
1901836 : executed unconditional branches (-2.5%)
4348783 : all function calls (-21.3%)
1434119 : indirect calls (+0.0%)
0 : PLT calls (-100.0%)
111194037 : executed instructions (-1.0%)
26414195 : executed load instructions (-0.0%)
12891276 : executed store instructions (=)
56880 : taken jump table branches (=)
0 : taken unknown indirect branches (=)
15631224 : total branches (-0.3%)
1904027 : taken branches (-2.5%)
13727197 : non-taken conditional branches (-0.0%)
2191 : taken conditional branches (+33.7%)
13729388 : all conditional branches (=)
The latest dynostats you posted are way worse than the ones from the original post. -2.5% taken branches vs -32.9%. As if the profile was collected on a different run/binary.
Are you running the experiments on the same hardware as the old ones?
I regenerated the results again with new profile, this time making sure the binary has the build id. perf2bolt logs do not suggest any significant profile mismatches.
> perf2bolt -strict=0 -o pgo-labels.fdata -w pgo-labels-compiler.yaml -p pgo-labels.perfdata pgo-labels/build/bin/clang-15
BOLT-INFO: shared object or position-independent executable detected
PERF2BOLT: Starting data aggregation job for pgo-labels.perfdata
PERF2BOLT: spawning perf job to read branch events
PERF2BOLT: spawning perf job to read mem events
PERF2BOLT: spawning perf job to read process events
PERF2BOLT: spawning perf job to read task events
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 27e6ac10524f80dcddf710a1d6bc2e04481a6040
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x5e00000, offset 0x5e00000
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-INFO: pre-processing profile using perf data aggregator
**BOLT-INFO: binary build-id is: 39c42271602dbdd3**
PERF2BOLT: spawning perf job to read buildid list
PERF2BOLT: matched build-id and file name
PERF2BOLT: waiting for perf mmap events collection to finish...
PERF2BOLT: parsing perf-script mmap events output
PERF2BOLT: waiting for perf task events collection to finish...
PERF2BOLT: parsing perf-script task events output
PERF2BOLT: input binary is associated with 108 PID(s)
PERF2BOLT: waiting for perf events collection to finish...
PERF2BOLT: parse branch events...
PERF2BOLT: read 688275 samples and 21945200 LBR entries
PERF2BOLT: 310 samples (0.0%) were ignored
PERF2BOLT: traces mismatching disassembled function contents: 576263 (2.7%)
PERF2BOLT: out of range traces involving unknown regions: 2718557 (12.8%)
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5clang13TreeTransformIN12_GLOBAL__N_120TemplateInstantiatorEE25TransformCXXNamedCastExprEPNS_16CXXNamedCastExprE.__uniq.55632760368638704870153814335850836202/1(*2)
BOLT-WARNING: 3 collisions detected while hashing binary objects. Use -v=1 to see the list.
PERF2BOLT: processing branch events...
PERF2BOLT: wrote 332051 objects and 0 memory objects to pgo-labels.fdata
dyno_stats are still similar.
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 27e6ac10524f80dcddf710a1d6bc2e04481a6040
BOLT-INFO: first alloc address is 0x0
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-INFO: pre-processing profile using YAML profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5clang13TreeTransformIN12_GLOBAL__N_120TemplateInstantiatorEE25TransformCXXNamedCastExprEPNS_16CXXNamedCastExprE.__uniq.55632760368638704870153814335850836202/1(*2)
BOLT-WARNING: 3 collisions detected while hashing binary objects. Use -v=1 to see the list.
BOLT-INFO: 9903 out of 137229 functions in the binary (7.2%) have non-empty execution profile
BOLT-INFO: 480 functions with profile could not be optimized
BOLT-INFO: the input contains 8025 (dynamic count : 231245) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: ICF folded 365 out of 137544 functions in 3 passes. 1 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 97.32 KB of code space. Folded functions were called 472 times based on profile.
BOLT-INFO: simplified 183 out of 4219 loads from a statically computed address.
BOLT-INFO: dynamic loads simplified: 10292
BOLT-INFO: dynamic loads found: 121571
BOLT-INFO: inlined 1834 calls at 95 call sites in 2 iteration(s). Change in binary size: -357 bytes.
BOLT-INFO: 10457 PLT calls in the binary were optimized.
BOLT-INFO: basic block reordering modified layout of 6324 (4.61%) functions
BOLT-INFO: UCE removed 0 blocks and 0 bytes of code.
BOLT-INFO: splitting separates 9654427 hot bytes from 9380539 cold bytes (50.72% of split functions is hot).
BOLT-INFO: 244 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 10018 to 5568
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:
10715216 : executed forward branches
1378 : taken forward branches
3122199 : executed backward branches
320 : taken backward branches
1963736 : executed unconditional branches
5559540 : all function calls
1450560 : indirect calls
1188502 : PLT calls
113164008 : executed instructions
26583888 : executed load instructions
12976293 : executed store instructions
58296 : taken jump table branches
0 : taken unknown indirect branches
15801151 : total branches
1965434 : taken branches
13835717 : non-taken conditional branches
1698 : taken conditional branches
13837415 : all conditional branches
12138932 : executed forward branches (+13.3%)
929 : taken forward branches (-32.6%)
1698483 : executed backward branches (-45.6%)
1178 : taken backward branches (+268.1%)
1963975 : executed unconditional branches (+0.0%)
4369204 : all function calls (-21.4%)
1450560 : indirect calls (=)
0 : PLT calls (-100.0%)
112169363 : executed instructions (-0.9%)
26573596 : executed load instructions (-0.0%)
12976293 : executed store instructions (=)
58296 : taken jump table branches (=)
0 : taken unknown indirect branches (=)
15801390 : total branches (+0.0%)
1966082 : taken branches (+0.0%)
13835308 : non-taken conditional branches (-0.0%)
2107 : taken conditional branches (+24.1%)
13837415 : all conditional branches (=)
BOLT-INFO: SCTC: patched 27 tail calls (24 forward) tail calls (3 backward) from a total of 29 while removing 1 double jumps and removing 22 basic blocks totalling 110 bytes of code. CTCs total execution count is 20 and the number of times CTCs are taken is 14.
BOLT-INFO: setting __hot_start to 0x5e00000
BOLT-INFO: setting __hot_end to 0x6c7573f
BOLT-INFO: patched build-id (flipped last bit)
Command being timed: "bolt_binaries/llvm-bolt -strict=0 pgo-labels/build/bin/clang-15 -o pgo-labels/build/bin/clang-15-bolt -b pgo-labels-compiler.yaml -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -split-all-cold -dyno-stats -icf=1 -use-gnu-stack -inline-small-functions -simplify-rodata-loads -plt=hot"
Next I'll generate results with upstream BOLT.
Thanks, Rahman. How was the input binary built?
Input binary is built with PGO and -Wl,-q
, though it does have the extra SHT_LLVM_BB_ADDR_MAP section (generated using -fbasic-block-sections=labels
which we never found to be an issue). The full cmake command is below:
cmake -G Ninja -DLLVM_OPTIMIZED_TABLEGEN=On -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_ENABLE_EH=On -DLLVM_ENABLE_RTTI=On -DLLVM_ENABLE_LLD="On" -DCMAKE_LINKER="lld" -DLLVM_TARGETS_TO_BUILD="X86" -DCMAKE_C_COMPILER="stage1/install/bin/clang" -DCMAKE_CXX_COMPILER="stage1/install/bin/clang++" -DCMAKE_ASM_COMPILER="stage1/install/bin/clang" -DLLVM_PROFDATA_FILE=source/llvm-project/relwithdeb/stage-pgo-labels.profdata -DLLVM_ENABLE_LTO=Thin -DCMAKE_C_FLAGS="-fdebug-compilation-dir=/proc/self/cwd -funique-internal-linkage-names -fbasic-block-sections=labels" -DCMAKE_CXX_FLAGS="-fdebug-compilation-dir=/proc/self/cwd -funique-internal-linkage-names -fbasic-block-sections=labels" -DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=lld -Wl,-gc-sections -Wl,--lto-basic-block-sections=labels -Wl,-z,keep-text-section-prefix -Wl,-q -Wl,-build-id" -DCMAKE_SHARED_LINKER_FLAGS="-fuse-ld=lld -Wl,-gc-sections -Wl,--lto-basic-block-sections=labels -Wl,-z,keep-text-section-prefix -Wl,-q -Wl,-build-id" -DCMAKE_MODULE_LINKER_FLAGS="-fuse-ld=lld -Wl,-gc-sections -Wl,--lto-basic-block-sections=labels -Wl,-z,keep-text-section-prefix -Wl,-q -Wl,-build-id" -DLLVM_ENABLE_PROJECTS="clang;compiler-rt;lld" source/llvm-project/llvm
Upstream bolt results regenerated:
> perf2bolt -o pgo-labels.fdata -w pgo-labels-compiler.yaml -p pgo-labels.perfdata pgo-labels/build/bin/clang-15
PERF2BOLT: Starting data aggregation job for pgo-labels.perfdata
PERF2BOLT: spawning perf job to read branch events
PERF2BOLT: spawning perf job to read mem events
PERF2BOLT: spawning perf job to read process events
PERF2BOLT: spawning perf job to read task events
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 3f028c02ba6a24b7230fd5907a2b7ba076664a8b
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x5e00000, offset 0x5e00000
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling strict relocation mode for aggregation purposes
BOLT-WARNING: Failed to analyze 4027 relocations
BOLT-INFO: pre-processing profile using perf data aggregator
BOLT-INFO: binary build-id is: 39c42271602dbdd3
PERF2BOLT: spawning perf job to read buildid list
PERF2BOLT: matched build-id and file name
PERF2BOLT: waiting for perf mmap events collection to finish...
PERF2BOLT: parsing perf-script mmap events output
PERF2BOLT: waiting for perf task events collection to finish...
PERF2BOLT: parsing perf-script task events output
PERF2BOLT: input binary is associated with 108 PID(s)
PERF2BOLT: waiting for perf events collection to finish...
PERF2BOLT: parse branch events...
PERF2BOLT: read 688275 samples and 21945200 LBR entries
PERF2BOLT: 310 samples (0.0%) were ignored
PERF2BOLT: traces mismatching disassembled function contents: 6810 (0.0%)
PERF2BOLT: out of range traces involving unknown regions: 2717188 (12.8%)
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5clang13TreeTransformIN12_GLOBAL__N_120TemplateInstantiatorEE25TransformCXXNamedCastExprEPNS_16CXXNamedCastExprE.__uniq.55632760368638704870153814335850836202/1(*2)
BOLT-WARNING: 4 collisions detected while hashing binary objects. Use -v=1 to see the list.
PERF2BOLT: processing branch events...
PERF2BOLT: wrote 511380 objects and 0 memory objects to pgo-labels.fdata
> llvm-bolt pgo-labels/build/bin/clang-15 -o pgo-labels/build/bin/clang-15-bolt -b pgo-labels-compiler.yaml -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions -split-all-cold -dyno-stats -icf=1 -use-gnu-stack -inline-small-functions -simplify-rodata-loads -plt=hot
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 3f028c02ba6a24b7230fd5907a2b7ba076664a8b
BOLT-INFO: first alloc address is 0x0
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 4027 relocations
BOLT-INFO: pre-processing profile using YAML profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5clang13TreeTransformIN12_GLOBAL__N_120TemplateInstantiatorEE25TransformCXXNamedCastExprEPNS_16CXXNamedCastExprE.__uniq.55632760368638704870153814335850836202/1(*2)
BOLT-WARNING: 2 collisions detected while hashing binary objects. Use -v=1 to see the list.
BOLT-INFO: 10357 out of 137229 functions in the binary (7.5%) have non-empty execution profile
BOLT-INFO: 506 functions with profile could not be optimized
BOLT-INFO: the input contains 7960 (dynamic count : 463293) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 671866 instructions were shortened
BOLT-INFO: removed 1669 empty blocks
BOLT-INFO: ICF folded 725 out of 137544 functions in 4 passes. 1 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 114.55 KB of code space. Folded functions were called 140176 times based on profile.
BOLT-INFO: simplified 181 out of 4488 loads from a statically computed address.
BOLT-INFO: dynamic loads simplified: 4963
BOLT-INFO: dynamic loads found: 86997
BOLT-INFO: inlined 1231 calls at 21 call sites in 2 iteration(s). Change in binary size: -14 bytes.
BOLT-INFO: 9068 PLT calls in the binary were optimized.
BOLT-INFO: basic block reordering modified layout of 7074 (5.17%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 11653450 hot bytes from 7951531 cold bytes (59.44% of split functions is hot).
BOLT-INFO: 180 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 10137 to 1147
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:
26442260 : executed forward branches
3464392 : taken forward branches
5235586 : executed backward branches
2954773 : taken backward branches
1636833 : executed unconditional branches
2158278 : all function calls
736311 : indirect calls
354131 : PLT calls
215407051 : executed instructions
54152504 : executed load instructions
27757842 : executed store instructions
291689 : taken jump table branches
0 : taken unknown indirect branches
33314679 : total branches
8055998 : taken branches
25258681 : non-taken conditional branches
6419165 : taken conditional branches
31677846 : all conditional branches
25133697 : executed forward branches (-4.9%)
1515084 : taken forward branches (-56.3%)
6544149 : executed backward branches (+25.0%)
2831984 : taken backward branches (-4.2%)
1153438 : executed unconditional branches (-29.5%)
1803332 : all function calls (-16.4%)
736311 : indirect calls (=)
0 : PLT calls (-100.0%)
213764106 : executed instructions (-0.8%)
54147887 : executed load instructions (-0.0%)
27757842 : executed store instructions (=)
291689 : taken jump table branches (=)
0 : taken unknown indirect branches (=)
32831284 : total branches (-1.5%)
5500506 : taken branches (-31.7%)
27330778 : non-taken conditional branches (+8.2%)
4347068 : taken conditional branches (-32.3%)
31677846 : all conditional branches (=)
BOLT-INFO: SCTC: patched 25 tail calls (22 forward) tail calls (3 backward) from a total of 27 while removing 2 double jumps and removing 17 basic blocks totalling 85 bytes of code. CTCs total execution count is 1230 and the number of times CTCs are taken is 1203.
BOLT-INFO: setting __hot_start to 0x5e00000
BOLT-INFO: setting __hot_end to 0x6d6979f
BOLT-INFO: patched build-id (flipped last bit)
Let me rebuild with a pure PGO binary built only with -Wl,-q -Wl,-build-id
.
Same story with the cleaner relocation-only Release build.
cmake command:
cmake -G Ninja -DLLVM_OPTIMIZED_TABLEGEN=On -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_EH=OFF -DLLVM_ENABLE_RTTI=OFF -DLLVM_ENABLE_LLD="On" -DCMAKE_LINKER="lld" -DLLVM_TARGETS_TO_BUILD="X86" -DCMAKE_C_COMPILER="stage1/install/bin/clang" -DCMAKE_CXX_COMPILER="stage1/install/bin/clang++" -DCMAKE_ASM_COMPILER="stage1/install/bin/clang" -DLLVM_PROFDATA_FILE=stage-pgo-relocs.profdata -DLLVM_ENABLE_LTO=Thin -DCMAKE_C_FLAGS="" -DCMAKE_CXX_FLAGS="" -DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=lld -Wl,-q -Wl,-build-id" -DCMAKE_SHARED_LINKER_FLAGS="-fuse-ld=lld -Wl,-q -Wl,-build-id" -DCMAKE_MODULE_LINKER_FLAGS="-fuse-ld=lld -Wl,-q -Wl,-build-id" -DLLVM_ENABLE_PROJECTS="clang;compiler-rt;lld" copt/source/llvm-project/llvm
And dyno-stats:
> llvm-bolt pgo-relocs/build/bin/clang-15 -o pgo-relocs/build/bin/clang-15-bolt -b pgo-relocs-compiler.yaml -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions -split-all-cold -dyno-stats -icf=1 -use-gnu-stack -inline-small-functions -simplify-rodata-loads -plt=hot
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 3f028c02ba6a24b7230fd5907a2b7ba076664a8b
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 2637 relocations
BOLT-INFO: pre-processing profile using YAML profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5clang15StmtVisitorBaseISt11add_pointerN12_GLOBAL__N_117ScalarExprEmitterEPN4llvm5ValueEJEE5VisitEPNS_4StmtE.llvm.14822649050216680576/1(*2)
BOLT-INFO: 6034 out of 137017 functions in the binary (4.4%) have non-empty execution profile
BOLT-INFO: 349 functions with profile could not be optimized
BOLT-WARNING: 1 (0.0% of all profiled) function have invalid (possibly stale) profile. Use -report-stale to see the list.
BOLT-INFO: the input contains 4333 (dynamic count : 279032) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 368635 instructions were shortened
BOLT-INFO: removed 350 empty blocks
BOLT-INFO: ICF folded 439 out of 137323 functions in 3 passes. 1 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 66.58 KB of code space. Folded functions were called 111275 times based on profile.
BOLT-INFO: simplified 102 out of 3567 loads from a statically computed address.
BOLT-INFO: dynamic loads simplified: 4396
BOLT-INFO: dynamic loads found: 61706
BOLT-INFO: inlined 1276 calls at 14 call sites in 2 iteration(s). Change in binary size: 8 bytes.
BOLT-INFO: 4989 PLT calls in the binary were optimized.
BOLT-INFO: basic block reordering modified layout of 3703 (2.71%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 3226141 hot bytes from 7660865 cold bytes (29.63% of split functions is hot).
BOLT-INFO: 110 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 5943 to 699
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:
17239346 : executed forward branches
1936233 : taken forward branches
2894058 : executed backward branches
1779156 : taken backward branches
857096 : executed unconditional branches
1677524 : all function calls
570439 : indirect calls
243989 : PLT calls
163029058 : executed instructions
38333435 : executed load instructions
20638863 : executed store instructions
224046 : taken jump table branches
0 : taken unknown indirect branches
20990500 : total branches
4572485 : taken branches
16418015 : non-taken conditional branches
3715389 : taken conditional branches
20133404 : all conditional branches
16770704 : executed forward branches (-2.7%)
823127 : taken forward branches (-57.5%)
3362700 : executed backward branches (+16.2%)
1641544 : taken backward branches (-7.7%)
596697 : executed unconditional branches (-30.4%)
1432669 : all function calls (-14.6%)
570439 : indirect calls (=)
0 : PLT calls (-100.0%)
162115376 : executed instructions (-0.6%)
38329392 : executed load instructions (-0.0%)
20638863 : executed store instructions (=)
224046 : taken jump table branches (=)
0 : taken unknown indirect branches (=)
20730101 : total branches (-1.2%)
3061368 : taken branches (-33.0%)
17668733 : non-taken conditional branches (+7.6%)
2464671 : taken conditional branches (-33.7%)
20133404 : all conditional branches (=)
BOLT-INFO: SCTC: patched 9 tail calls (9 forward) tail calls (0 backward) from a total of 9 while removing 2 double jumps and removing 10 basic blocks totalling 50 bytes of code. CTCs total execution count is 1320 and the number of times CTCs are taken is 1253.
BOLT-INFO: setting __hot_start to 0x5400000
BOLT-INFO: setting __hot_end to 0x59d1365
BOLT-INFO: patched build-id (flipped last bit)
What do you mean by the "same story"? The latest dynostats look reasonable to me. E.g. taken branches "-33.0%".
Same story as in the first comment. So no regression, but the improvement is smaller than before (5.5%).
Gotcha. Same hardware as before?
Yes. It will be great if you could also redo the build and compare with your previous numbers.
Our previous evaluation was on older version of Clang and I don't expect it to be different, but I can give it a go. It's entirely possible that lesser gains are seen with Clang-15 for a number of reasons.
I'm still seeing 25%-30% gains with BOLT on baseline (-O3) Clang-15.
For reference, I find that BOLT gives little benefit when optimizing a clang-14 binary when that binary already has IR+CSIR PGO optimization (+ thin LTO).
With the llvm 14 version of BOLT I get about a 4% benefit and with the llvm 15 version (also doing all the PGO compiles with the same compiler) only about 1.5%. In both cases the target is the same: clang-14, but the toolchain used to compile it is either v14 or v15.
One possibility is that clang-15 w/PGO+LTO already does a better job of the optimizations that fall into BOLT's domain, rather than BOLT having regressed between v14 and v15.
For reference, I find that BOLT gives little benefit when optimizing a clang-14 binary when that binary already has IR+CSIR PGO optimization (+ thin LTO).
With the llvm 14 version of BOLT I get about a 4% benefit and with the llvm 15 version (also doing all the PGO compiles with the same compiler) only about 1.5%. In both cases the target is the same: clang-14, but the toolchain used to compile it is either v14 or v15.
One possibility is that clang-15 w/PGO+LTO already does a better job of the optimizations that fall into BOLT's domain, rather than BOLT having regressed between v14 and v15.
To clarify: what exactly does IR+CSIR PGO mean? Did you use two profiles in Clang?
To clarify: what exactly does IR+CSIR PGO mean? Did you use two profiles in Clang?
To clarify: what exactly does IR+CSIR PGO mean? Did you use two profiles in Clang?
To clarify: what exactly does IR+CSIR PGO mean? Did you use two profiles in Clang?
Build clang once with LLVM_BUILD_INSTRUMENTED=IR
, run a benchmark and collect the profile, then rebuild clang again with that profile and LLVM_BUILD_INSTRUMENTED=CSIR
to collect a context sensitive profile and run the same benchmark again. Merge the profiles together and build clang a third time pointing to the merged profiles. Then BOLT.
All of this with thin LTO enabled.
To me, it sounds that CSIR should be doing the code layout optimizations in the compiler similar to what BOLT is doing in the binary, plus a better register allocation at the cost of another profiling run and recompilation. The fact that you are still able to get 1.5% on top of that is actually quite surprising. It will be interesting to compare "IR+BOLT" vs "IR+CSIR+BOLT" to find out how much performance are you gaining from having CSIR in the middle.
Right, I think BOLT is already "context sensitive" in the sense of CSIR, since it works on the final binary, after all inlining: it couldn't really be anything other than context sensitive.
So perhaps a lot of the benefit of BOLT vs vanilla PGO actually comes from this angle: vanilla PGO (as I understand it) only counts statistics at the unexpanded source level, so a function (for example) has 1 set of statistics, even though it might be inlined into 100 call sites, and those call sides behave wildly differently. BOLT fixes this by it's nature: every inlined copy is considered distinctly, and CSIR does a similar thing.
BOLT of course still has a lot more beyond that, since it does optimizations which the LLVM doesn't do today.
My recent experience with LLVM trunk shows a smaller improvement on clang than my prior experience with the incubator repo (https://github.com/facebookincubator/BOLT).
Here is the log for perf2bolt and llvm-bolt:
I am measuring 5.5% improvement on top of PGO binary (compared to around 9-10% I was seeing before):