Open ptr1337 opened 2 years ago
Here a example with the latest commit from main:
--instrumentation-file-append-pid
Binary which will be used for the example:
/usr/lib/gcc/x86_64-pc-linux-gnu/12/cc1
LD_PRELOAD=/usr/lib/libjemalloc.so ${BOLTPATH}/llvm-bolt \
--instrument \
--instrumentation-file=${FDATA}/${BINARY}.fdata \
${BINARYPATH}/${BINARY} \
-o ${BOLTBIN}/${BINARY} || (echo "Could not create instrumented binary"; exit 1)
Instrument binary with llvm-bolt
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: f3caa98e495188b03ea7f38ffb40f3955d785553
BOLT-INFO: first alloc address is 0x400000
BOLT-INFO: creating new program header table at address 0x2400000, offset 0x2000000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: forcing -jump-tables=move for instrumentation
BOLT-INFO: enabling -align-macro-fusion=all since no profile was specified
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 4 relocations
BOLT-INSTRUMENTER: Number of indirect call site descriptors: 45030
BOLT-INSTRUMENTER: Number of indirect call target descriptors: 42195
BOLT-INSTRUMENTER: Number of function descriptors: 42195
BOLT-INSTRUMENTER: Number of branch counters: 574441
BOLT-INSTRUMENTER: Number of ST leaf node counters: 266846
BOLT-INSTRUMENTER: Number of direct call counters: 386
BOLT-INSTRUMENTER: Total number of counters: 841673
BOLT-INSTRUMENTER: Total size of counters: 6733384 bytes (static alloc memory)
BOLT-INSTRUMENTER: Total size of string table emitted: 2186532 bytes in file
BOLT-INSTRUMENTER: Total size of descriptors: 55086932 bytes in file
BOLT-INSTRUMENTER: Profile will be saved to file /home/ptr1337/toolchain/bolt/fdata/cc1.fdata
BOLT-INFO: 0 out of 42578 functions in the binary (0.0%) have non-empty execution profile
BOLT-INFO: the input contains 5885 (dynamic count : 0) opportunities for macro-fusion optimization that are going to be fixed
BOLT-INFO: 437397 instructions were shortened
BOLT-INFO: removed 1342 empty blocks
BOLT-INFO: merged 14 duplicate CFG edges
BOLT-INFO: UCE removed 19763 blocks and 1121388 bytes of code.
BOLT-INFO: SCTC: patched 0 tail calls (0 forward) tail calls (0 backward) from a total of 0 while removing 0 double jumps and removing 0 basic blocks totalling 0 bytes of code. CTCs total execution count is 0 and the number of times CTCs are taken is 0.
BOLT-INFO: output linked against instrumentation runtime library, lib entry point is 0x5c583f0
BOLT-INFO: clear procedure is 0x5c52340
BOLT-INFO: setting _end to 0x210e9d8
BOLT-INFO: setting _end to 0x210e9d8
BOLT-INFO: patched build-id (flipped last bit)
Now running a workload and compiling gcc and then optimize the binary with the profile, when watching at the profile size when running the workload you'll see big jumps between 300kb and 50+MB.
When I the compile was done the fdata
has around 8 MB.
LD_PRELOAD=/usr/lib/libjemalloc.so ${BOLTPATH}/llvm-bolt ${BOLTBIN}/${BINARY}.org \
--data ${BOLTBIN}/${BINARY}-combined.fdata \
-o ${BOLTBIN}/${BINARY}.bolt \
-split-functions \
-split-all-cold \
-split-eh \
-dyno-stats \
-reorder-functions=hfsort+ \
-icp-eliminate-loads \
-reorder-blocks=ext-tsp \
-icf || (echo "Could not optimize the binary"; exit 1)
Optimizing binary with generated profile
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: f3caa98e495188b03ea7f38ffb40f3955d785553
BOLT-INFO: first alloc address is 0x400000
BOLT-INFO: creating new program header table at address 0x2400000, offset 0x2000000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 4 relocations
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function _cpp_parse_expr
BOLT-INFO: 4969 out of 42578 functions in the binary (11.7%) have non-empty execution profile
BOLT-INFO: 44 functions with profile could not be optimized
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: the input contains 1305 (dynamic count : 193784) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 108440 instructions were shortened
BOLT-INFO: removed 9942 empty blocks
BOLT-INFO: merged 3 duplicate CFG edges
BOLT-INFO: ICF folded 365 out of 42578 functions in 4 passes. 0 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 14.59 KB of code space. Folded functions were called 5304 times based on profile.
BOLT-INFO: basic block reordering modified layout of 2160 (5.12%) functions
BOLT-INFO: UCE removed 0 blocks and 0 bytes of code.
BOLT-INFO: splitting separates 1106816 hot bytes from 2586495 cold bytes (29.97% of split functions is hot).
BOLT-INFO: 365 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 4672 to 451
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:
7261683 : executed forward branches
915435 : taken forward branches
1237970 : executed backward branches
846764 : taken backward branches
302350 : executed unconditional branches
1292956 : all function calls
298107 : indirect calls
0 : PLT calls
56166826 : executed instructions
12635237 : executed load instructions
4072522 : executed store instructions
82800 : taken jump table branches
0 : taken unknown indirect branches
8802003 : total branches
2064549 : taken branches
6737454 : non-taken conditional branches
1762199 : taken conditional branches
8499653 : all conditional branches
6370405 : executed forward branches (-12.3%)
521763 : taken forward branches (-43.0%)
2129240 : executed backward branches (+72.0%)
770904 : taken backward branches (-9.0%)
250568 : executed unconditional branches (-17.1%)
1292956 : all function calls (=)
298107 : indirect calls (=)
0 : PLT calls (=)
56070842 : executed instructions (-0.2%)
12635237 : executed load instructions (=)
4072522 : executed store instructions (=)
82800 : taken jump table branches (=)
0 : taken unknown indirect branches (=)
8750213 : total branches (-0.6%)
1543235 : taken branches (-25.3%)
7206978 : non-taken conditional branches (+7.0%)
1292667 : taken conditional branches (-26.6%)
8499645 : all conditional branches (-0.0%)
BOLT-INFO: SCTC: patched 92 tail calls (80 forward) tail calls (12 backward) from a total of 92 while removing 11 double jumps and removing 62 basic blocks totalling 310 bytes of code. CTCs total execution count is 17440 and the number of times CTCs are taken is 7426.
BOLT-INFO: setting _end to 0x210e9d8
BOLT-INFO: setting _end to 0x210e9d8
BOLT-INFO: setting __hot_start to 0x2600000
BOLT-INFO: setting __hot_end to 0x27368c7
BOLT-INFO: patched build-id (flipped last bit)
You can find now your optimzed binary at /home/ptr1337/toolchain/bolt/bin
--instrumentation-file-append-pid
LD_PRELOAD=/usr/lib/libjemalloc.so ${BOLTPATH}/llvm-bolt \
--instrument \
--instrumentation-file-append-pid \
--instrumentation-file=${FDATA}/${BINARY}.fdata \
${BINARYPATH}/${BINARY} \
-o ${BOLTBIN}/${BINARY} || (echo "Could not create instrumented binary"; exit 1)
Instrument binary with llvm-bolt
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: f3caa98e495188b03ea7f38ffb40f3955d785553
BOLT-INFO: first alloc address is 0x400000
BOLT-INFO: creating new program header table at address 0x2400000, offset 0x2000000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: forcing -jump-tables=move for instrumentation
BOLT-INFO: enabling -align-macro-fusion=all since no profile was specified
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 4 relocations
BOLT-INSTRUMENTER: Number of indirect call site descriptors: 45030
BOLT-INSTRUMENTER: Number of indirect call target descriptors: 42195
BOLT-INSTRUMENTER: Number of function descriptors: 42195
BOLT-INSTRUMENTER: Number of branch counters: 574441
BOLT-INSTRUMENTER: Number of ST leaf node counters: 266846
BOLT-INSTRUMENTER: Number of direct call counters: 386
BOLT-INSTRUMENTER: Total number of counters: 841673
BOLT-INSTRUMENTER: Total size of counters: 6733384 bytes (static alloc memory)
BOLT-INSTRUMENTER: Total size of string table emitted: 2186532 bytes in file
BOLT-INSTRUMENTER: Total size of descriptors: 55086932 bytes in file
BOLT-INSTRUMENTER: Profile will be saved to file /home/ptr1337/toolchain/bolt/fdata/cc1.fdata
BOLT-INFO: 0 out of 42578 functions in the binary (0.0%) have non-empty execution profile
BOLT-INFO: the input contains 5885 (dynamic count : 0) opportunities for macro-fusion optimization that are going to be fixed
BOLT-INFO: 437397 instructions were shortened
BOLT-INFO: removed 1342 empty blocks
BOLT-INFO: merged 14 duplicate CFG edges
BOLT-INFO: UCE removed 19763 blocks and 1121388 bytes of code.
BOLT-INFO: SCTC: patched 0 tail calls (0 forward) tail calls (0 backward) from a total of 0 while removing 0 double jumps and removing 0 basic blocks totalling 0 bytes of code. CTCs total execution count is 0 and the number of times CTCs are taken is 0.
BOLT-INFO: output linked against instrumentation runtime library, lib entry point is 0x5c583f0
BOLT-INFO: clear procedure is 0x5c52340
BOLT-INFO: setting _end to 0x210e9d8
BOLT-INFO: setting _end to 0x210e9d8
BOLT-INFO: patched build-id (flipped last bit)
After just running 3 min the workload I have 20.9 GB data gathered with the --instrumentation-file-append-pid
, which I will merge now together.
After merging the cc1.fdata
has 61MB.
Used same options to optimize the target as above, using now the other profile:
Optimizing binary with generated profile
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: f3caa98e495188b03ea7f38ffb40f3955d785553
BOLT-INFO: first alloc address is 0x400000
BOLT-INFO: creating new program header table at address 0x2400000, offset 0x2000000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: Failed to analyze 4 relocations
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function get_DW_TAG_name
BOLT-INFO: 14989 out of 42578 functions in the binary (35.2%) have non-empty execution profile
BOLT-INFO: 99 functions with profile could not be optimized
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: the input contains 3648 (dynamic count : 1361313899) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 339602 instructions were shortened
BOLT-INFO: removed 17619 empty blocks
BOLT-INFO: merged 4 duplicate CFG edges
BOLT-INFO: ICF folded 1248 out of 42578 functions in 4 passes. 0 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 104.07 KB of code space. Folded functions were called 393930282 times based on profile.
BOLT-INFO: basic block reordering modified layout of 7658 (18.53%) functions
BOLT-INFO: UCE removed 0 blocks and 0 bytes of code.
BOLT-INFO: splitting separates 5337660 hot bytes from 5200983 cold bytes (50.65% of split functions is hot).
BOLT-INFO: 802 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 13856 to 3395
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:
151391987892 : executed forward branches
19476986419 : taken forward branches
19966266912 : executed backward branches
12976486585 : taken backward branches
5805983074 : executed unconditional branches
27161239781 : all function calls
3751089236 : indirect calls
0 : PLT calls
1099387641420 : executed instructions
298270895664 : executed load instructions
139101555672 : executed store instructions
2044189844 : taken jump table branches
0 : taken unknown indirect branches
177164237878 : total branches
38259456078 : taken branches
138904781800 : non-taken conditional branches
32453473004 : taken conditional branches
171358254804 : all conditional branches
147020862489 : executed forward branches (-2.9%)
11684918898 : taken forward branches (-40.0%)
24334482145 : executed backward branches (+21.9%)
9986780158 : taken backward branches (-23.0%)
5838845666 : executed unconditional branches (+0.6%)
27161239781 : all function calls (=)
3751089236 : indirect calls (=)
0 : PLT calls (=)
1099418182516 : executed instructions (+0.0%)
298270895664 : executed load instructions (=)
139101555672 : executed store instructions (=)
2044189844 : taken jump table branches (=)
0 : taken unknown indirect branches (=)
177194190300 : total branches (+0.0%)
27510544722 : taken branches (-28.1%)
149683645578 : non-taken conditional branches (+7.8%)
21671699056 : taken conditional branches (-33.2%)
171355344634 : all conditional branches (-0.0%)
BOLT-INFO: SCTC: patched 229 tail calls (202 forward) tail calls (27 backward) from a total of 229 while removing 15 double jumps and removing 169 basic blocks totalling 845 bytes of code. CTCs total execution count is 1406120689 and the number of times CTCs are taken is 751591871.
BOLT-INFO: setting _end to 0x210e9d8
BOLT-INFO: setting _end to 0x210e9d8
BOLT-INFO: setting __hot_start to 0x2600000
BOLT-INFO: setting __hot_end to 0x2ba3ff7
BOLT-INFO: patched build-id (flipped last bit)
You can find now your optimzed binary at /home/ptr1337/toolchain/bolt/bin
Just some updated data.
@llvm/issue-subscribers-bolt
Hey,
As already mentioned here some time ago, https://github.com/ClangBuiltLinux/tc-build/issues/155#issuecomment-1106831385 the bug seems still present in the main branch.
If instrumenting a binary with:
The output of the profile does not get correctly merged automatically, only the latest process seems to profiled correctly. For example if running a workload with the instrumented binary will result into a profile which has very less branches.
example:
running the same workload with following options:
Will result into a massive amount of profiles which uses a lot of space. After compiling clang with the instrumented binary results into a over 100GB of
fdata
files. After merging them withmerge-fdata
and using the profile with llvm-bolt shows following stats:Think the discussion at https://github.com/ClangBuiltLinux/tc-build/issues/155 could be also helpful since this bug is there a bit discussed.
The last tests ive run was with commit https://github.com/llvm/llvm-project/commit/96f6ec5090c2f7a1e4804693cbb84c29c574b3de and also their seems everything equal:
Regards.