llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
29.41k stars 12.15k forks source link

BOLT gives lower improvement on clang-bootstrap than before #56274

Open rlavaee opened 2 years ago

rlavaee commented 2 years ago

My recent experience with LLVM trunk shows a smaller improvement on clang than my prior experience with the incubator repo (https://github.com/facebookincubator/BOLT).

Here is the log for perf2bolt and llvm-bolt:

> perf2bolt -o pgo-labels.fdata -w pgo-labels-compiler.yaml -p pgo-labels.perfdata clang-15                                                                                                                                                             
BOLT-INFO: shared object or position-independent executable detected                                                                                                                                                                                                                                                                                                                                                  
PERF2BOLT: Starting data aggregation job for pgo-labels.perfdata                                                                                                                                                                                                                                                                                                                                                      
PERF2BOLT: spawning perf job to read branch events                                                                                                                                                                                                                                                                                                                                                                    
PERF2BOLT: spawning perf job to read mem events                                                                                                                                                                                                                                                                                                                                                                       
PERF2BOLT: spawning perf job to read process events                                                                                                                                                                                                                                                                                                                                                                   
PERF2BOLT: spawning perf job to read task events                                                                                                                                                                                                                                                                                                                                                                      
BOLT-INFO: Target architecture: x86_64                                                                                                                                                                                                                                                                                                                                                                                
BOLT-INFO: BOLT version: 3f028c02ba6a24b7230fd5907a2b7ba076664a8b                                                                                                                                                                                                                                                                                                                                                     
BOLT-INFO: first alloc address is 0x0                                                                                                                                                                                                                                                                                                                                                                                 
BOLT-INFO: creating new program header table at address 0x5400000, offset 0x5400000                                                                                                                                                                                                                                                                                                                                   
BOLT-INFO: enabling relocation mode                                                                                                                                                                                                                                                                                                                                                                                   
BOLT-INFO: enabling strict relocation mode for aggregation purposes                                                                                                                                                                                                                                                                                                                                                   
BOLT-WARNING: Failed to analyze 2529 relocations                                                                                                                                                                                                                                                                                                                                                                      
BOLT-INFO: pre-processing profile using perf data aggregator                                                                                                                                                                                                                                                                                                                                                          
BOLT-WARNING: build-id will not be checked because we could not read one from input binary                                                                                                                                                                                                                                                                                                                            
PERF2BOLT: waiting for perf mmap events collection to finish...                                                                                                                                                                                                                                                                                                                                                       
PERF2BOLT: parsing perf-script mmap events output                                                                                                                                                                                                                                                                                                                                                                     
PERF2BOLT: waiting for perf task events collection to finish...                                                                                                                                                                                                                                                                                                                                                       
PERF2BOLT: parsing perf-script task events output                                                                                                                                                                                                                                                                                                                                                                     
PERF2BOLT: input binary is associated with 100 PID(s)                                                                                                                                                                                                                                                                                                                                                                 
PERF2BOLT: waiting for perf events collection to finish...                                                                                                                                                                                                                                                                                                                                                            
PERF2BOLT: parse branch events...                                                                                                      
PERF2BOLT: read 492075 samples and 15682980 LBR entries                                                                                
PERF2BOLT: 216 samples (0.0%) were ignored                                                                                             
PERF2BOLT: traces mismatching disassembled function contents: 5324 (0.0%)                                                              
PERF2BOLT: out of range traces involving unknown regions: 1618631 (10.7%)                                                              
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN4llvm10BasicBlock28replaceSuccessorsPhiUsesWithEPS0_S1_                                                                                                                                                             
BOLT-WARNING: 4 collisions detected while hashing binary objects. Use -v=1 to see the list.                                                                                                                                                                                    
PERF2BOLT: processing branch events..
> llvm-bolt clang-15 -o clang-15-bolt -b pgo-relocs-compiler.yaml -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions -split-all-cold -dyno-stats -icf=1 -use-gnu-stack -inline-small-functions -simplify-rodata-loads -plt=hot

BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 3f028c02ba6a24b7230fd5907a2b7ba076664a8b
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 2529 relocations
BOLT-INFO: pre-processing profile using YAML profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN4llvm10BasicBlock28replaceSuccessorsPhiUsesWithEPS0_S1_
BOLT-INFO: 6042 out of 136908 functions in the binary (4.4%) have non-empty execution profile
BOLT-INFO: 347 functions with profile could not be optimized
BOLT-INFO: the input contains 4354 (dynamic count : 268784) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 371417 instructions were shortened
BOLT-INFO: removed 344 empty blocks
BOLT-INFO: ICF folded 413 out of 137214 functions in 3 passes. 0 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 59.75 KB of code space. Folded functions were called 113460 times based on profile.
BOLT-INFO: simplified 102 out of 3594 loads from a statically computed address.
BOLT-INFO: dynamic loads simplified: 4317
BOLT-INFO: dynamic loads found: 61577
BOLT-INFO: inlined 1227 calls at 18 call sites in 2 iteration(s). Change in binary size: 4 bytes.
BOLT-INFO: 4879 PLT calls in the binary were optimized.
BOLT-INFO: basic block reordering modified layout of 3729 (2.73%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 3226174 hot bytes from 7737417 cold bytes (29.43% of split functions is hot).
BOLT-INFO: 106 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 5975 to 650
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

            17279782 : executed forward branches
             1942886 : taken forward branches
             2900344 : executed backward branches
             1779625 : taken backward branches
              855760 : executed unconditional branches
             1686232 : all function calls
              571541 : indirect calls
              243850 : PLT calls
           163314338 : executed instructions
            38492046 : executed load instructions
            20762991 : executed store instructions
              224132 : taken jump table branches
                   0 : taken unknown indirect branches
            21035886 : total branches
             4578271 : taken branches
            16457615 : non-taken conditional branches
             3722511 : taken conditional branches
            20180126 : all conditional branches

            16810312 : executed forward branches (-2.7%)
              824937 : taken forward branches (-57.5%)
             3369814 : executed backward branches (+16.2%)
             1647148 : taken backward branches (-7.4%)
              599903 : executed unconditional branches (-29.9%)
             1441570 : all function calls (-14.5%)
              571541 : indirect calls (=)
                   0 : PLT calls (-100.0%)
           162404688 : executed instructions (-0.6%)
            38488076 : executed load instructions (-0.0%)
            20762991 : executed store instructions (=)
              224132 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
            20780029 : total branches (-1.2%)
             3071988 : taken branches (-32.9%)
            17708041 : non-taken conditional branches (+7.6%)
             2472085 : taken conditional branches (-33.6%)
            20180126 : all conditional branches (=)

BOLT-INFO: SCTC: patched 8 tail calls (8 forward) tail calls (0 backward) from a total of 8 while removing 0 double jumps and removing 8 basic blocks totalling 40 bytes of code. CTCs total execution count is 1207 and the number of times CTCs are taken is 1164.
BOLT-INFO: setting __hot_start to 0x5400000
BOLT-INFO: setting __hot_end to 0x59d53e5

I am measuring 5.5% improvement on top of PGO binary (compared to around 9-10% I was seeing before):

pgo-labels-bolt-compiler -> average(507.406)
pgo-labels-compiler -> average(537.33)
Metric: time
Group 1 mean = 537.330005 ± 1.036598
Group 2 mean = 507.406000 ± 3.630159
P value      = 2.01e-05
Diff mean (95% CI)  = -29.9240 ± 3.5663
Percent   (95% CI) = -5.5690% (± 0.6637%)
llvmbot commented 2 years ago

@llvm/issue-subscribers-bolt

maksfb commented 2 years ago

Thanks for letting know. Are you optimizing the same clang-15 binary as before? Do you have dynostats from previous BOLT where you saw larger gains?

rlavaee commented 2 years ago

Unfortunately, I don't have stats from builds with larger gains now. Also, my old perf2bolt (compiled about 1 year ago from incubator repo) fails to run on this binary.

PERF2BOLT: out of range traces involving unknown regions: 2688310 (12.7%)
perf2bolt: $$$$/bolt/src/BinaryContext.cpp:764: void llvm::bolt::BinaryContext::populateJumpTables(): Assertion `0 && "unclaimed PC-relative relocations left in data\n"' failed.
 #0 0x0000559a4c4e3fb0 PrintStackTraceSignalHandler(void*) Signals.cpp:0:0
 #1 0x0000559a4c4e1d4e SignalHandler(int) Signals.cpp:0:0
 #2 0x00007f7385626200 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x12200)
 #3 0x00007f738509b8a1 raise ./signal/../sysdeps/unix/sysv/linux/raise.c:50:1
 #4 0x00007f7385085546 abort ./stdlib/abort.c:81:7
 #5 0x00007f738508542f get_sysdep_segment_value ./intl/loadmsgcat.c:509:8
 #6 0x00007f738508542f _nl_load_domain ./intl/loadmsgcat.c:970:34
 #7 0x00007f7385094222 (/lib/x86_64-linux-gnu/libc.so.6+0x31222)
 #8 0x0000559a4b6e3593 llvm::bolt::BinaryContext::populateJumpTables() (${HOME}/copt/build/bolt_binaries/perf2bolt+0x231593)
 #9 0x0000559a4b7b9831 llvm::bolt::RewriteInstance::disassembleFunctions() (${HOME}/copt/build/bolt_binaries/perf2bolt+0x307831)
#10 0x0000559a4b8121ea llvm::bolt::RewriteInstance::run() (${HOME}/copt/build/bolt_binaries/perf2bolt+0x3601ea)
#11 0x0000559a4b6685e9 main (${HOME}/copt/build/bolt_binaries/perf2bolt+0x1b65e9)
#12 0x00007f73850867fd __libc_start_main ./csu/../csu/libc-start.c:332:16
#13 0x0000559a4b6bf4da _start (${HOME}/copt/build/bolt_binaries/perf2bolt+0x20d4da)
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: ${HOME}/copt/build/bolt_binaries/perf2bolt -o pgo-labels.fdata -w pgo-labels-compiler.yaml -p pgo-labels.perfdata ${HOME}/copt/source/llvm-project/relwithdeb/pgo-labels/build/bin/clang-15
maksfb commented 2 years ago

Can you try —strict=0?

rlavaee commented 2 years ago

Old llvm-bolt works with -strict=0, but I am getting regression and dyno_stats are consistent with regression:

           10634721 : executed forward branches
                1291 : taken forward branches
             3094667 : executed backward branches
                 348 : taken backward branches
             1950602 : executed unconditional branches
             5528009 : all function calls
             1434116 : indirect calls
             1177373 : PLT calls
           112346938 : executed instructions
            26424638 : executed load instructions
            12891276 : executed store instructions
               56880 : taken jump table branches
                   0 : taken unknown indirect branches
            15679990 : total branches
             1952241 : taken branches
            13727749 : non-taken conditional branches
                1639 : taken conditional branches
            13729388 : all conditional branches

            11948700 : executed forward branches (+12.4%)
                 908 : taken forward branches (-29.7%)
             1780688 : executed backward branches (-42.5%)
                1283 : taken backward branches (+268.7%)
             1901836 : executed unconditional branches (-2.5%)
             4348783 : all function calls (-21.3%)
             1434119 : indirect calls (+0.0%)
                   0 : PLT calls (-100.0%)
           111194037 : executed instructions (-1.0%)
            26414195 : executed load instructions (-0.0%)
            12891276 : executed store instructions (=)
               56880 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
            15631224 : total branches (-0.3%)
             1904027 : taken branches (-2.5%)
            13727197 : non-taken conditional branches (-0.0%)
                2191 : taken conditional branches (+33.7%)
            13729388 : all conditional branches (=)
maksfb commented 2 years ago

The latest dynostats you posted are way worse than the ones from the original post. -2.5% taken branches vs -32.9%. As if the profile was collected on a different run/binary.

Are you running the experiments on the same hardware as the old ones?

rlavaee commented 2 years ago

I regenerated the results again with new profile, this time making sure the binary has the build id. perf2bolt logs do not suggest any significant profile mismatches.

> perf2bolt -strict=0 -o pgo-labels.fdata -w pgo-labels-compiler.yaml -p pgo-labels.perfdata pgo-labels/build/bin/clang-15
BOLT-INFO: shared object or position-independent executable detected
PERF2BOLT: Starting data aggregation job for pgo-labels.perfdata
PERF2BOLT: spawning perf job to read branch events
PERF2BOLT: spawning perf job to read mem events
PERF2BOLT: spawning perf job to read process events
PERF2BOLT: spawning perf job to read task events
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 27e6ac10524f80dcddf710a1d6bc2e04481a6040
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x5e00000, offset 0x5e00000
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-INFO: pre-processing profile using perf data aggregator
**BOLT-INFO: binary build-id is:     39c42271602dbdd3**
PERF2BOLT: spawning perf job to read buildid list
PERF2BOLT: matched build-id and file name
PERF2BOLT: waiting for perf mmap events collection to finish...
PERF2BOLT: parsing perf-script mmap events output
PERF2BOLT: waiting for perf task events collection to finish...
PERF2BOLT: parsing perf-script task events output
PERF2BOLT: input binary is associated with 108 PID(s)
PERF2BOLT: waiting for perf events collection to finish...
PERF2BOLT: parse branch events...
PERF2BOLT: read 688275 samples and 21945200 LBR entries
PERF2BOLT: 310 samples (0.0%) were ignored
PERF2BOLT: traces mismatching disassembled function contents: 576263 (2.7%)
PERF2BOLT: out of range traces involving unknown regions: 2718557 (12.8%)
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5clang13TreeTransformIN12_GLOBAL__N_120TemplateInstantiatorEE25TransformCXXNamedCastExprEPNS_16CXXNamedCastExprE.__uniq.55632760368638704870153814335850836202/1(*2)
BOLT-WARNING: 3 collisions detected while hashing binary objects. Use -v=1 to see the list.
PERF2BOLT: processing branch events...
PERF2BOLT: wrote 332051 objects and 0 memory objects to pgo-labels.fdata

dyno_stats are still similar.

BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 27e6ac10524f80dcddf710a1d6bc2e04481a6040
BOLT-INFO: first alloc address is 0x0
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-INFO: pre-processing profile using YAML profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5clang13TreeTransformIN12_GLOBAL__N_120TemplateInstantiatorEE25TransformCXXNamedCastExprEPNS_16CXXNamedCastExprE.__uniq.55632760368638704870153814335850836202/1(*2)
BOLT-WARNING: 3 collisions detected while hashing binary objects. Use -v=1 to see the list.
BOLT-INFO: 9903 out of 137229 functions in the binary (7.2%) have non-empty execution profile
BOLT-INFO: 480 functions with profile could not be optimized
BOLT-INFO: the input contains 8025 (dynamic count : 231245) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: ICF folded 365 out of 137544 functions in 3 passes. 1 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 97.32 KB of code space. Folded functions were called 472 times based on profile.
BOLT-INFO: simplified 183 out of 4219 loads from a statically computed address.
BOLT-INFO: dynamic loads simplified: 10292
BOLT-INFO: dynamic loads found: 121571
BOLT-INFO: inlined 1834 calls at 95 call sites in 2 iteration(s). Change in binary size: -357 bytes.
BOLT-INFO: 10457 PLT calls in the binary were optimized.
BOLT-INFO: basic block reordering modified layout of 6324 (4.61%) functions
BOLT-INFO: UCE removed 0 blocks and 0 bytes of code.
BOLT-INFO: splitting separates 9654427 hot bytes from 9380539 cold bytes (50.72% of split functions is hot).
BOLT-INFO: 244 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 10018 to 5568
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

            10715216 : executed forward branches
                1378 : taken forward branches
             3122199 : executed backward branches
                 320 : taken backward branches
             1963736 : executed unconditional branches
             5559540 : all function calls
             1450560 : indirect calls
             1188502 : PLT calls
           113164008 : executed instructions
            26583888 : executed load instructions
            12976293 : executed store instructions
               58296 : taken jump table branches
                   0 : taken unknown indirect branches
            15801151 : total branches
             1965434 : taken branches
            13835717 : non-taken conditional branches
                1698 : taken conditional branches
            13837415 : all conditional branches

            12138932 : executed forward branches (+13.3%)
                 929 : taken forward branches (-32.6%)
             1698483 : executed backward branches (-45.6%)
                1178 : taken backward branches (+268.1%)
             1963975 : executed unconditional branches (+0.0%)
             4369204 : all function calls (-21.4%)
             1450560 : indirect calls (=)
                   0 : PLT calls (-100.0%)
           112169363 : executed instructions (-0.9%)
            26573596 : executed load instructions (-0.0%)
            12976293 : executed store instructions (=)
               58296 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
            15801390 : total branches (+0.0%)
             1966082 : taken branches (+0.0%)
            13835308 : non-taken conditional branches (-0.0%)
                2107 : taken conditional branches (+24.1%)
            13837415 : all conditional branches (=)

BOLT-INFO: SCTC: patched 27 tail calls (24 forward) tail calls (3 backward) from a total of 29 while removing 1 double jumps and removing 22 basic blocks totalling 110 bytes of code. CTCs total execution count is 20 and the number of times CTCs are taken is 14.
BOLT-INFO: setting __hot_start to 0x5e00000
BOLT-INFO: setting __hot_end to 0x6c7573f
BOLT-INFO: patched build-id (flipped last bit)
    Command being timed: "bolt_binaries/llvm-bolt -strict=0 pgo-labels/build/bin/clang-15 -o pgo-labels/build/bin/clang-15-bolt -b pgo-labels-compiler.yaml -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions=3 -split-all-cold -dyno-stats -icf=1 -use-gnu-stack -inline-small-functions -simplify-rodata-loads -plt=hot"

Next I'll generate results with upstream BOLT.

maksfb commented 2 years ago

Thanks, Rahman. How was the input binary built?

rlavaee commented 2 years ago

Input binary is built with PGO and -Wl,-q, though it does have the extra SHT_LLVM_BB_ADDR_MAP section (generated using -fbasic-block-sections=labels which we never found to be an issue). The full cmake command is below:

cmake -G Ninja -DLLVM_OPTIMIZED_TABLEGEN=On -DCMAKE_BUILD_TYPE=RelWithDebInfo -DLLVM_ENABLE_EH=On -DLLVM_ENABLE_RTTI=On -DLLVM_ENABLE_LLD="On" -DCMAKE_LINKER="lld" -DLLVM_TARGETS_TO_BUILD="X86" -DCMAKE_C_COMPILER="stage1/install/bin/clang" -DCMAKE_CXX_COMPILER="stage1/install/bin/clang++" -DCMAKE_ASM_COMPILER="stage1/install/bin/clang" -DLLVM_PROFDATA_FILE=source/llvm-project/relwithdeb/stage-pgo-labels.profdata -DLLVM_ENABLE_LTO=Thin -DCMAKE_C_FLAGS="-fdebug-compilation-dir=/proc/self/cwd -funique-internal-linkage-names -fbasic-block-sections=labels" -DCMAKE_CXX_FLAGS="-fdebug-compilation-dir=/proc/self/cwd -funique-internal-linkage-names -fbasic-block-sections=labels" -DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=lld -Wl,-gc-sections -Wl,--lto-basic-block-sections=labels -Wl,-z,keep-text-section-prefix -Wl,-q -Wl,-build-id" -DCMAKE_SHARED_LINKER_FLAGS="-fuse-ld=lld -Wl,-gc-sections -Wl,--lto-basic-block-sections=labels -Wl,-z,keep-text-section-prefix -Wl,-q -Wl,-build-id" -DCMAKE_MODULE_LINKER_FLAGS="-fuse-ld=lld -Wl,-gc-sections -Wl,--lto-basic-block-sections=labels -Wl,-z,keep-text-section-prefix -Wl,-q -Wl,-build-id" -DLLVM_ENABLE_PROJECTS="clang;compiler-rt;lld" source/llvm-project/llvm

Upstream bolt results regenerated:

> perf2bolt -o pgo-labels.fdata -w pgo-labels-compiler.yaml -p pgo-labels.perfdata pgo-labels/build/bin/clang-15

PERF2BOLT: Starting data aggregation job for pgo-labels.perfdata                                 
PERF2BOLT: spawning perf job to read branch events
PERF2BOLT: spawning perf job to read mem events        
PERF2BOLT: spawning perf job to read process events
PERF2BOLT: spawning perf job to read task events           
BOLT-INFO: Target architecture: x86_64                                                                                                                                                                                                                                                                                                                                                                                                                                           
BOLT-INFO: BOLT version: 3f028c02ba6a24b7230fd5907a2b7ba076664a8b                          
BOLT-INFO: first alloc address is 0x0                                                         
BOLT-INFO: creating new program header table at address 0x5e00000, offset 0x5e00000
BOLT-INFO: enabling relocation mode                                                                                                                                                                                                     
BOLT-INFO: enabling strict relocation mode for aggregation purposes
BOLT-WARNING: Failed to analyze 4027 relocations  
BOLT-INFO: pre-processing profile using perf data aggregator                               
BOLT-INFO: binary build-id is:     39c42271602dbdd3                                                                                                                                                                                     
PERF2BOLT: spawning perf job to read buildid list                              
PERF2BOLT: matched build-id and file name                                                              
PERF2BOLT: waiting for perf mmap events collection to finish...                                                                                                                                                                         
PERF2BOLT: parsing perf-script mmap events output                                                  
PERF2BOLT: waiting for perf task events collection to finish...
PERF2BOLT: parsing perf-script task events output                          
PERF2BOLT: input binary is associated with 108 PID(s)
PERF2BOLT: waiting for perf events collection to finish...                                                   
PERF2BOLT: parse branch events...                           
PERF2BOLT: read 688275 samples and 21945200 LBR entries           
PERF2BOLT: 310 samples (0.0%) were ignored                                                                                                                                                                                                                                                                                                                                                                                                                                       
PERF2BOLT: traces mismatching disassembled function contents: 6810 (0.0%)                                                                                                                                                                                                                                                                                                                                                                                                        
PERF2BOLT: out of range traces involving unknown regions: 2717188 (12.8%)
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5clang13TreeTransformIN12_GLOBAL__N_120TemplateInstantiatorEE25TransformCXXNamedCastExprEPNS_16CXXNamedCastExprE.__uniq.55632760368638704870153814335850836202/1(*2)
BOLT-WARNING: 4 collisions detected while hashing binary objects. Use -v=1 to see the list.
PERF2BOLT: processing branch events...                                                                                                                                                                                                                                                                                                                                                                                                                                           
PERF2BOLT: wrote 511380 objects and 0 memory objects to pgo-labels.fdata
> llvm-bolt pgo-labels/build/bin/clang-15 -o pgo-labels/build/bin/clang-15-bolt -b pgo-labels-compiler.yaml -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions -split-all-cold -dyno-stats -icf=1 -use-gnu-stack -inline-small-functions -simplify-rodata-loads -plt=hot
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 3f028c02ba6a24b7230fd5907a2b7ba076664a8b
BOLT-INFO: first alloc address is 0x0
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 4027 relocations
BOLT-INFO: pre-processing profile using YAML profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5clang13TreeTransformIN12_GLOBAL__N_120TemplateInstantiatorEE25TransformCXXNamedCastExprEPNS_16CXXNamedCastExprE.__uniq.55632760368638704870153814335850836202/1(*2)
BOLT-WARNING: 2 collisions detected while hashing binary objects. Use -v=1 to see the list.
BOLT-INFO: 10357 out of 137229 functions in the binary (7.5%) have non-empty execution profile
BOLT-INFO: 506 functions with profile could not be optimized
BOLT-INFO: the input contains 7960 (dynamic count : 463293) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 671866 instructions were shortened
BOLT-INFO: removed 1669 empty blocks
BOLT-INFO: ICF folded 725 out of 137544 functions in 4 passes. 1 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 114.55 KB of code space. Folded functions were called 140176 times based on profile.
BOLT-INFO: simplified 181 out of 4488 loads from a statically computed address.
BOLT-INFO: dynamic loads simplified: 4963
BOLT-INFO: dynamic loads found: 86997
BOLT-INFO: inlined 1231 calls at 21 call sites in 2 iteration(s). Change in binary size: -14 bytes.
BOLT-INFO: 9068 PLT calls in the binary were optimized.
BOLT-INFO: basic block reordering modified layout of 7074 (5.17%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 11653450 hot bytes from 7951531 cold bytes (59.44% of split functions is hot).
BOLT-INFO: 180 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 10137 to 1147
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

            26442260 : executed forward branches
             3464392 : taken forward branches
             5235586 : executed backward branches
             2954773 : taken backward branches
             1636833 : executed unconditional branches
             2158278 : all function calls
              736311 : indirect calls
              354131 : PLT calls
           215407051 : executed instructions
            54152504 : executed load instructions
            27757842 : executed store instructions
              291689 : taken jump table branches
                   0 : taken unknown indirect branches
            33314679 : total branches
             8055998 : taken branches
            25258681 : non-taken conditional branches
             6419165 : taken conditional branches
            31677846 : all conditional branches

            25133697 : executed forward branches (-4.9%)
             1515084 : taken forward branches (-56.3%)
             6544149 : executed backward branches (+25.0%)
             2831984 : taken backward branches (-4.2%)
             1153438 : executed unconditional branches (-29.5%)
             1803332 : all function calls (-16.4%)
              736311 : indirect calls (=)
                   0 : PLT calls (-100.0%)
           213764106 : executed instructions (-0.8%)
            54147887 : executed load instructions (-0.0%)
            27757842 : executed store instructions (=)
              291689 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
            32831284 : total branches (-1.5%)
             5500506 : taken branches (-31.7%)
            27330778 : non-taken conditional branches (+8.2%)
             4347068 : taken conditional branches (-32.3%)
            31677846 : all conditional branches (=)

BOLT-INFO: SCTC: patched 25 tail calls (22 forward) tail calls (3 backward) from a total of 27 while removing 2 double jumps and removing 17 basic blocks totalling 85 bytes of code. CTCs total execution count is 1230 and the number of times CTCs are taken is 1203.
BOLT-INFO: setting __hot_start to 0x5e00000
BOLT-INFO: setting __hot_end to 0x6d6979f
BOLT-INFO: patched build-id (flipped last bit)
rlavaee commented 2 years ago

Let me rebuild with a pure PGO binary built only with -Wl,-q -Wl,-build-id.

rlavaee commented 2 years ago

Same story with the cleaner relocation-only Release build. cmake command: cmake -G Ninja -DLLVM_OPTIMIZED_TABLEGEN=On -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_EH=OFF -DLLVM_ENABLE_RTTI=OFF -DLLVM_ENABLE_LLD="On" -DCMAKE_LINKER="lld" -DLLVM_TARGETS_TO_BUILD="X86" -DCMAKE_C_COMPILER="stage1/install/bin/clang" -DCMAKE_CXX_COMPILER="stage1/install/bin/clang++" -DCMAKE_ASM_COMPILER="stage1/install/bin/clang" -DLLVM_PROFDATA_FILE=stage-pgo-relocs.profdata -DLLVM_ENABLE_LTO=Thin -DCMAKE_C_FLAGS="" -DCMAKE_CXX_FLAGS="" -DCMAKE_EXE_LINKER_FLAGS="-fuse-ld=lld -Wl,-q -Wl,-build-id" -DCMAKE_SHARED_LINKER_FLAGS="-fuse-ld=lld -Wl,-q -Wl,-build-id" -DCMAKE_MODULE_LINKER_FLAGS="-fuse-ld=lld -Wl,-q -Wl,-build-id" -DLLVM_ENABLE_PROJECTS="clang;compiler-rt;lld" copt/source/llvm-project/llvm

And dyno-stats:

> llvm-bolt pgo-relocs/build/bin/clang-15 -o pgo-relocs/build/bin/clang-15-bolt -b pgo-relocs-compiler.yaml -reorder-blocks=ext-tsp -reorder-functions=hfsort+ -split-functions -split-all-cold -dyno-stats -icf=1 -use-gnu-stack -inline-small-functions -simplify-rodata-loads -plt=hot

BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 3f028c02ba6a24b7230fd5907a2b7ba076664a8b
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 2637 relocations
BOLT-INFO: pre-processing profile using YAML profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _ZN5clang15StmtVisitorBaseISt11add_pointerN12_GLOBAL__N_117ScalarExprEmitterEPN4llvm5ValueEJEE5VisitEPNS_4StmtE.llvm.14822649050216680576/1(*2)
BOLT-INFO: 6034 out of 137017 functions in the binary (4.4%) have non-empty execution profile
BOLT-INFO: 349 functions with profile could not be optimized
BOLT-WARNING: 1 (0.0% of all profiled) function have invalid (possibly stale) profile. Use -report-stale to see the list.
BOLT-INFO: the input contains 4333 (dynamic count : 279032) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 368635 instructions were shortened
BOLT-INFO: removed 350 empty blocks
BOLT-INFO: ICF folded 439 out of 137323 functions in 3 passes. 1 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 66.58 KB of code space. Folded functions were called 111275 times based on profile.
BOLT-INFO: simplified 102 out of 3567 loads from a statically computed address.
BOLT-INFO: dynamic loads simplified: 4396
BOLT-INFO: dynamic loads found: 61706
BOLT-INFO: inlined 1276 calls at 14 call sites in 2 iteration(s). Change in binary size: 8 bytes.
BOLT-INFO: 4989 PLT calls in the binary were optimized.
BOLT-INFO: basic block reordering modified layout of 3703 (2.71%) functions
BOLT-INFO: UCE removed 1 blocks and 7 bytes of code.
BOLT-INFO: splitting separates 3226141 hot bytes from 7660865 cold bytes (29.63% of split functions is hot).
BOLT-INFO: 110 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 5943 to 699
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

            17239346 : executed forward branches
             1936233 : taken forward branches
             2894058 : executed backward branches
             1779156 : taken backward branches
              857096 : executed unconditional branches
             1677524 : all function calls
              570439 : indirect calls
              243989 : PLT calls
           163029058 : executed instructions
            38333435 : executed load instructions
            20638863 : executed store instructions
              224046 : taken jump table branches
                   0 : taken unknown indirect branches
            20990500 : total branches
             4572485 : taken branches
            16418015 : non-taken conditional branches
             3715389 : taken conditional branches
            20133404 : all conditional branches

            16770704 : executed forward branches (-2.7%)
              823127 : taken forward branches (-57.5%)
             3362700 : executed backward branches (+16.2%)
             1641544 : taken backward branches (-7.7%)
              596697 : executed unconditional branches (-30.4%)
             1432669 : all function calls (-14.6%)
              570439 : indirect calls (=)
                   0 : PLT calls (-100.0%)
           162115376 : executed instructions (-0.6%)
            38329392 : executed load instructions (-0.0%)
            20638863 : executed store instructions (=)
              224046 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
            20730101 : total branches (-1.2%)
             3061368 : taken branches (-33.0%)
            17668733 : non-taken conditional branches (+7.6%)
             2464671 : taken conditional branches (-33.7%)
            20133404 : all conditional branches (=)

BOLT-INFO: SCTC: patched 9 tail calls (9 forward) tail calls (0 backward) from a total of 9 while removing 2 double jumps and removing 10 basic blocks totalling 50 bytes of code. CTCs total execution count is 1320 and the number of times CTCs are taken is 1253.
BOLT-INFO: setting __hot_start to 0x5400000
BOLT-INFO: setting __hot_end to 0x59d1365
BOLT-INFO: patched build-id (flipped last bit)
maksfb commented 2 years ago

What do you mean by the "same story"? The latest dynostats look reasonable to me. E.g. taken branches "-33.0%".

rlavaee commented 2 years ago

Same story as in the first comment. So no regression, but the improvement is smaller than before (5.5%).

maksfb commented 2 years ago

Gotcha. Same hardware as before?

rlavaee commented 2 years ago

Yes. It will be great if you could also redo the build and compare with your previous numbers.

maksfb commented 2 years ago

Our previous evaluation was on older version of Clang and I don't expect it to be different, but I can give it a go. It's entirely possible that lesser gains are seen with Clang-15 for a number of reasons.

maksfb commented 2 years ago

I'm still seeing 25%-30% gains with BOLT on baseline (-O3) Clang-15.

travisdowns commented 1 year ago

For reference, I find that BOLT gives little benefit when optimizing a clang-14 binary when that binary already has IR+CSIR PGO optimization (+ thin LTO).

With the llvm 14 version of BOLT I get about a 4% benefit and with the llvm 15 version (also doing all the PGO compiles with the same compiler) only about 1.5%. In both cases the target is the same: clang-14, but the toolchain used to compile it is either v14 or v15.

One possibility is that clang-15 w/PGO+LTO already does a better job of the optimizations that fall into BOLT's domain, rather than BOLT having regressed between v14 and v15.

aaupov commented 1 year ago

For reference, I find that BOLT gives little benefit when optimizing a clang-14 binary when that binary already has IR+CSIR PGO optimization (+ thin LTO).

With the llvm 14 version of BOLT I get about a 4% benefit and with the llvm 15 version (also doing all the PGO compiles with the same compiler) only about 1.5%. In both cases the target is the same: clang-14, but the toolchain used to compile it is either v14 or v15.

One possibility is that clang-15 w/PGO+LTO already does a better job of the optimizations that fall into BOLT's domain, rather than BOLT having regressed between v14 and v15.

To clarify: what exactly does IR+CSIR PGO mean? Did you use two profiles in Clang?

travisdowns commented 1 year ago

To clarify: what exactly does IR+CSIR PGO mean? Did you use two profiles in Clang?

To clarify: what exactly does IR+CSIR PGO mean? Did you use two profiles in Clang?

To clarify: what exactly does IR+CSIR PGO mean? Did you use two profiles in Clang?

Build clang once with LLVM_BUILD_INSTRUMENTED=IR , run a benchmark and collect the profile, then rebuild clang again with that profile and LLVM_BUILD_INSTRUMENTED=CSIR to collect a context sensitive profile and run the same benchmark again. Merge the profiles together and build clang a third time pointing to the merged profiles. Then BOLT.

All of this with thin LTO enabled.

maksfb commented 1 year ago

To me, it sounds that CSIR should be doing the code layout optimizations in the compiler similar to what BOLT is doing in the binary, plus a better register allocation at the cost of another profiling run and recompilation. The fact that you are still able to get 1.5% on top of that is actually quite surprising. It will be interesting to compare "IR+BOLT" vs "IR+CSIR+BOLT" to find out how much performance are you gaining from having CSIR in the middle.

travisdowns commented 1 year ago

Right, I think BOLT is already "context sensitive" in the sense of CSIR, since it works on the final binary, after all inlining: it couldn't really be anything other than context sensitive.

So perhaps a lot of the benefit of BOLT vs vanilla PGO actually comes from this angle: vanilla PGO (as I understand it) only counts statistics at the unexpanded source level, so a function (for example) has 1 set of statistics, even though it might be inlined into 100 call sites, and those call sides behave wildly differently. BOLT fixes this by it's nature: every inlined copy is considered distinctly, and CSIR does a similar thing.

BOLT of course still has a lot more beyond that, since it does optimizations which the LLVM doesn't do today.