llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.13k stars 11.62k forks source link

BOLT optimized clang and lld segfault #60045

Open dakkshesh07 opened 1 year ago

dakkshesh07 commented 1 year ago

I tried using BOLT to optimise clang and lld for compiling Linux kernels, however the BOLT optimised clang and lld segfault while attempting to compile the Linux kernel (6.1.6), yet the vanilla binaries compile the kernel without trouble. At the time of compilation the LLVM source was at https://github.com/llvm/llvm-project/commit/10cdad4065d7d3b53be3e0f03a2d71951c2bacd6

I have attached build configuration bits from my script with logs from llvm-bolt below:

  1. Compile stage 1
    
    $ LLVM_BIN_DIR="/usr/lib/llvm-14/bin"
    $ OPT_FLAGS="-O3 -march=native -mtune=native -ffunction-sections -fdata-sections"
    $ OPT_FLAGS_LD="-Wl,-O3,--sort-common,--as-needed,-z,now -fuse-ld=$LLVM_BIN_DIR/ld.lld"
    $ cmake -G Ninja -Wno-dev --log-level=NOTICE \
    -DLLVM_TARGETS_TO_BUILD="X86" \
    -DLLVM_ENABLE_PROJECTS="clang;lld;compiler-rt;polly;openmp;bolt" \
    -DCMAKE_BUILD_TYPE=Release \
    -DCLANG_ENABLE_ARCMT=OFF \
    -DCLANG_ENABLE_STATIC_ANALYZER=OFF \
    -DCLANG_PLUGIN_SUPPORT=OFF \
    -DLLVM_ENABLE_BINDINGS=OFF \
    -DLLVM_ENABLE_OCAMLDOC=OFF \
    -DLLVM_INCLUDE_EXAMPLES=OFF \
    -DLLVM_EXTERNAL_CLANG_TOOLS_EXTRA_SOURCE_DIR= \
    -DLLVM_INCLUDE_TESTS=OFF \
    -DLLVM_INCLUDE_DOCS=OFF \
    -DLLVM_ENABLE_TERMINFO=OFF \
    -DCOMPILER_RT_BUILD_CRT=OFF \
    -DCOMPILER_RT_BUILD_SANITIZERS=OFF \
    -DCOMPILER_RT_BUILD_XRAY=OFF \
    -DCOMPILER_RT_BUILD_LIBFUZZER=OFF \
    -DLLVM_ENABLE_BACKTRACES=OFF \
    -DLLVM_ENABLE_WARNINGS=OFF \
    -DLLVM_ENABLE_LTO=Thin \
    -DLLVM_CCACHE_BUILD=ON \
    -DCMAKE_C_COMPILER="$LLVM_BIN_DIR"/clang \
    -DCMAKE_CXX_COMPILER="$LLVM_BIN_DIR"/clang++ \
    -DCMAKE_AR="$LLVM_BIN_DIR"/llvm-ar \
    -DCMAKE_NM="$LLVM_BIN_DIR"/llvm-nm \
    -DCMAKE_STRIP="$LLVM_BIN_DIR"/llvm-strip \
    -DLLVM_USE_LINKER="$LLVM_BIN_DIR"/ld.lld \
    -DCMAKE_LINKER="$LLVM_BIN_DIR"/ld.lld \
    -DCMAKE_OBJCOPY="$LLVM_BIN_DIR"/llvm-objcopy \
    -DCMAKE_OBJDUMP="$LLVM_BIN_DIR"/llvm-objdump \
    -DCMAKE_RANLIB="$LLVM_BIN_DIR"/llvm-ranlib \
    -DCMAKE_READELF="$LLVM_BIN_DIR"/llvm-readelf \
    -DCMAKE_ADDR2LINE="$LLVM_BIN_DIR"/llvm-addr2line \
    -DLLVM_PARALLEL_COMPILE_JOBS="$(nproc --all)" \
    -DLLVM_PARALLEL_LINK_JOBS="$(nproc --all)" \
    -DCMAKE_C_FLAGS="$OPT_FLAGS" \
    -DCMAKE_ASM_FLAGS="$OPT_FLAGS" \
    -DCMAKE_CXX_FLAGS="$OPT_FLAGS" \
    -DCMAKE_EXE_LINKER_FLAGS="$OPT_FLAGS_LD" \
    -DCMAKE_MODULE_LINKER_FLAGS="$OPT_FLAGS_LD" \
    -DCMAKE_SHARED_LINKER_FLAGS="$OPT_FLAGS_LD" \
    "../llvm-project/llvm"

$ ninja -j"$(nproc --all)"

2. Compile stage 2
```bash
$ OPT_FLAGS="-march=x86-64 -mtune=generic -ffunction-sections -fdata-sections -flto=thin -fsplit-lto-unit -O3"
$ OPT_FLAGS_LD="-Wl,-O3,--sort-common,--as-needed,-z,now,--lto-O3 -fuse-ld=$STAGE1/ld.lld"

$ cmake -G Ninja -Wno-dev --log-level=NOTICE \
    -DCLANG_VENDOR="Neutron" \
    -DLLVM_TARGETS_TO_BUILD='AArch64;ARM;X86' \
    -DCMAKE_BUILD_TYPE=Release \
    -DLLVM_ENABLE_WARNINGS=OFF \
    -DLLVM_ENABLE_PROJECTS='clang;lld' \
    -DLLVM_BINUTILS_INCDIR="$BUILDDIR/binutils-gdb/include" \
    -DLLVM_ENABLE_PLUGINS=ON \
    -DCLANG_ENABLE_ARCMT=OFF \
    -DCLANG_ENABLE_STATIC_ANALYZER=OFF \
    -DCLANG_PLUGIN_SUPPORT=OFF \
    -DLLVM_ENABLE_BINDINGS=OFF \
    -DLLVM_ENABLE_OCAMLDOC=OFF \
    -DLLVM_EXTERNAL_CLANG_TOOLS_EXTRA_SOURCE_DIR='' \
    -DLLVM_INCLUDE_DOCS=OFF \
    -DLLVM_INCLUDE_EXAMPLES=OFF \
    -DLLVM_ENABLE_TERMINFO=OFF \
    -DLLVM_ENABLE_LTO=Thin \
    -DCMAKE_C_COMPILER="$STAGE1"/clang \
    -DCMAKE_CXX_COMPILER="$STAGE1"/clang++ \
    -DCMAKE_AR="$STAGE1"/llvm-ar \
    -DCMAKE_NM="$STAGE1"/llvm-nm \
    -DCMAKE_STRIP="$STAGE1"/llvm-strip \
    -DLLVM_USE_LINKER="$STAGE1"/ld.lld \
    -DCMAKE_LINKER="$STAGE1"/ld.lld \
    -DCMAKE_OBJCOPY="$STAGE1"/llvm-objcopy \
    -DCMAKE_OBJDUMP="$STAGE1"/llvm-objdump \
    -DCMAKE_RANLIB="$STAGE1"/llvm-ranlib \
    -DCMAKE_READELF="$STAGE1"/llvm-readelf \
    -DCMAKE_ADDR2LINE="$STAGE1"/llvm-addr2line \
    -DCLANG_TABLEGEN="$STAGE1"/clang-tblgen \
    -DLLVM_TABLEGEN="$STAGE1"/llvm-tblgen \
    -DLLVM_BUILD_INSTRUMENTED=IR \
    -DLLVM_BUILD_RUNTIME=OFF \
    -DLLVM_LINK_LLVM_DYLIB=ON \
    -DLLVM_VP_COUNTERS_PER_SITE=6 \
    -DLLVM_PARALLEL_COMPILE_JOBS="$(nproc --all)" \
    -DLLVM_PARALLEL_LINK_JOBS="$(nproc --all)" \
    -DCMAKE_C_FLAGS="$OPT_FLAGS" \
    -DCMAKE_ASM_FLAGS="$OPT_FLAGS" \
    -DCMAKE_CXX_FLAGS="$OPT_FLAGS" \
    -DCMAKE_EXE_LINKER_FLAGS="$OPT_FLAGS_LD" \
    -DCMAKE_MODULE_LINKER_FLAGS="$OPT_FLAGS_LD" \
    -DCMAKE_SHARED_LINKER_FLAGS="$OPT_FLAGS_LD" \
    -DCMAKE_INSTALL_PREFIX="$OUT/install" \
    "../llvm-project/llvm"

$ ninja install -j"$(nproc --all)"
  1. Compile Linux kernel to generate PGO profiles for stage 3
    
    $ export LLD_IN_TEST=1

$ make distclean defconfig all -sj"$(nproc --all)" \ LLVM=1 \ LLVM_IAS=1 \ CC="$STAGE2"/clang \ LD="$STAGE2"/ld.lld \ AR="$STAGE2"/llvm-ar \ NM="$STAGE2"/llvm-nm \ STRIP="$STAGE2"/llvm-strip \ OBJCOPY="$STAGE2"/llvm-objcopy \ OBJDUMP="$STAGE2"/llvm-objdump \ READELF="$STAGE2"/llvm-readelf \ HOSTCC="$STAGE2"/clang \ HOSTCXX="$STAGE2"/clang++ \ HOSTAR="$STAGE2"/llvm-ar \ HOSTLD="$STAGE2"/ld.lld || exit ${?}

$ make distclean defconfig all -sj"$(nproc --all)" \ LLVM=1 \ LLVM_IAS=1 \ ARCH=arm64 \ CC="$STAGE2"/clang \ LD="$STAGE2"/ld.lld \ AR="$STAGE2"/llvm-ar \ NM="$STAGE2"/llvm-nm \ STRIP="$STAGE2"/llvm-strip \ OBJCOPY="$STAGE2"/llvm-objcopy \ OBJDUMP="$STAGE2"/llvm-objdump \ READELF="$STAGE2"/llvm-readelf \ HOSTCC="$STAGE2"/clang \ HOSTCXX="$STAGE2"/clang++ \ HOSTAR="$STAGE2"/llvm-ar \ HOSTLD="$STAGE2"/ld.lld \ CROSS_COMPILE=aarch64-linux-gnu- || exit ${?}

$ unset LLD_IN_TEST

$ cd "$PROFILES" $ "$STAGE2"/llvm-profdata merge -output=clang.profdata ./*

4. Compile stage 3
```bash
$ OPT_FLAGS="-O3 -march=x86-64 -mtune=generic -ffunction-sections -fdata-sections -flto=full -falign-functions=32"
$ OPT_FLAGS_LD="-Wl,-O3,--sort-common,--as-needed,-z,now,--lto-O3 -fuse-ld=$STAGE1/ld.lld"
$ OPT_FLAGS_LD_EXE="$OPT_FLAGS_LD -Wl,-znow -Wl,--emit-relocs"
$ cmake -G Ninja -Wno-dev --log-level=NOTICE \
    -DCLANG_VENDOR="Neutron" \
    -DLLVM_TARGETS_TO_BUILD='AArch64;ARM;X86' \
    -DCMAKE_BUILD_TYPE=Release \
    -DLLVM_ENABLE_WARNINGS=OFF \
    -DLLVM_ENABLE_PROJECTS='clang;lld;compiler-rt;polly;openmp' \
    -DLLVM_BINUTILS_INCDIR="$BUILDDIR/binutils-gdb/include" \
    -DLLVM_ENABLE_PLUGINS=ON \
    -DCLANG_ENABLE_ARCMT=OFF \
    -DCLANG_ENABLE_STATIC_ANALYZER=OFF \
    -DCLANG_PLUGIN_SUPPORT=OFF \
    -DLLVM_ENABLE_BINDINGS=OFF \
    -DLLVM_ENABLE_OCAMLDOC=OFF \
    -DLLVM_EXTERNAL_CLANG_TOOLS_EXTRA_SOURCE_DIR='' \
    -DLLVM_INCLUDE_DOCS=OFF \
    -DLLVM_INCLUDE_EXAMPLES=OFF \
    -DCOMPILER_RT_BUILD_LIBFUZZER=OFF \
    -DCOMPILER_RT_BUILD_CRT=OFF \
    -DCOMPILER_RT_BUILD_XRAY=OFF \
    -DLLVM_ENABLE_TERMINFO=OFF \
    -DLLVM_ENABLE_LTO=Full \
    -DCMAKE_C_COMPILER="$STAGE1"/clang \
    -DCMAKE_CXX_COMPILER="$STAGE1"/clang++ \
    -DCMAKE_AR="$STAGE1"/llvm-ar \
    -DCMAKE_NM="$STAGE1"/llvm-nm \
    -DCMAKE_STRIP="$STAGE1"/llvm-strip \
    -DLLVM_USE_LINKER="$STAGE1"/ld.lld \
    -DCMAKE_LINKER="$STAGE1"/ld.lld \
    -DCMAKE_OBJCOPY="$STAGE1"/llvm-objcopy \
    -DCMAKE_OBJDUMP="$STAGE1"/llvm-objdump \
    -DCMAKE_RANLIB="$STAGE1"/llvm-ranlib \
    -DCMAKE_READELF="$STAGE1"/llvm-readelf \
    -DCMAKE_ADDR2LINE="$STAGE1"/llvm-addr2line \
    -DCLANG_TABLEGEN="$STAGE1"/clang-tblgen \
    -DLLVM_TABLEGEN="$STAGE1"/llvm-tblgen \
    -DLLVM_PROFDATA_FILE="$PROFILES"/clang.profdata \
    -DLLVM_PARALLEL_COMPILE_JOBS="$(nproc --all)" \
    -DLLVM_PARALLEL_LINK_JOBS="$(nproc --all)" \
    -DCMAKE_C_FLAGS="$OPT_FLAGS" \
    -DCMAKE_ASM_FLAGS="$OPT_FLAGS" \
    -DCMAKE_CXX_FLAGS="$OPT_FLAGS" \
    -DCMAKE_EXE_LINKER_FLAGS="$OPT_FLAGS_LD_EXE" \
    -DCMAKE_MODULE_LINKER_FLAGS="$OPT_FLAGS_LD" \
    -DCMAKE_SHARED_LINKER_FLAGS="$OPT_FLAGS_LD" \
    -DCMAKE_INSTALL_PREFIX="$OUT/install" \
    "../llvm-project/llvm"
  1. Create BOLT instrumented clang and lld binaries
    
    $ CLANG_SUFFIX="clang-16"
    $ "$STAGE1"/llvm-bolt \
            --instrument \
            --instrumentation-file-append-pid \
            --instrumentation-file="${BOLT_PROFILES}/${CLANG_SUFFIX}.fdata" \
            "${STAGE3}/${CLANG_SUFFIX}" \
            -o "${STAGE3}/${CLANG_SUFFIX}.inst"

BOLT-INFO: shared object or position-independent executable detected BOLT-INFO: Target architecture: x86_64 BOLT-INFO: BOLT version: 10cdad4065d7d3b53be3e0f03a2d71951c2bacd6 BOLT-INFO: first alloc address is 0x0 BOLT-INFO: creating new program header table at address 0x5c00000, offset 0x5c00000 BOLT-INFO: enabling relocation mode BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it. BOLT-INFO: forcing -jump-tables=move for instrumentation BOLT-INFO: enabling -align-macro-fusion=all since no profile was specified BOLT-INFO: enabling lite mode BOLT-WARNING: Failed to analyze 1111 relocations BOLT-INFO: 0 out of 106312 functions in the binary (0.0%) have non-empty execution profile BOLT-INFO: the input contains 17421 (dynamic count : 0) opportunities for macro-fusion optimization that are going to be fixed BOLT-INSTRUMENTER: Number of indirect call site descriptors: 39012 BOLT-INSTRUMENTER: Number of indirect call target descriptors: 105526 BOLT-INSTRUMENTER: Number of function descriptors: 105518 BOLT-INSTRUMENTER: Number of branch counters: 1423216 BOLT-INSTRUMENTER: Number of ST leaf node counters: 675009 BOLT-INSTRUMENTER: Number of direct call counters: 0 BOLT-INSTRUMENTER: Total number of counters: 2098225 BOLT-INSTRUMENTER: Total size of counters: 16785800 bytes (static alloc memory) BOLT-INSTRUMENTER: Total size of string table emitted: 12686413 bytes in file BOLT-INSTRUMENTER: Total size of descriptors: 141668176 bytes in file BOLT-INSTRUMENTER: Profile will be saved to file /home/ubuntu/clang-build/llvm-build/stage3/bolt-prof/clang-16.fdata BOLT-INFO: 1137434 instructions were shortened BOLT-INFO: removed 149 empty blocks BOLT-INFO: UCE removed 700 blocks and 42368 bytes of code. BOLT-INFO: SCTC: patched 0 tail calls (0 forward) tail calls (0 backward) from a total of 0 while removing 0 double jumps and removing 0 basic blocks totalling 0 bytes of code. CTCs total execution count is 0 and the number of times CTCs are taken is 0. BOLT-INFO: output linked against instrumentation runtime library, lib entry point is 0xf4294d0 BOLT-INFO: clear procedure is 0xf424370

$ mv "${STAGE3}/${CLANG_SUFFIX}" "${STAGE3}/${CLANG_SUFFIX}.org" $ mv "${STAGE3}/${CLANG_SUFFIX}.inst" "${STAGE3}/${CLANG_SUFFIX}"

$ "$STAGE1"/llvm-bolt \ --instrument \ --instrumentation-file-append-pid \ --instrumentation-file="${BOLT_PROFILES_LLD}/lld.fdata" \ "${STAGE3}/lld" \ -o "${STAGE3}/lld.inst"

BOLT-INFO: shared object or position-independent executable detected BOLT-INFO: Target architecture: x86_64 BOLT-INFO: BOLT version: 10cdad4065d7d3b53be3e0f03a2d71951c2bacd6 BOLT-INFO: first alloc address is 0x0 BOLT-INFO: creating new program header table at address 0x4000000, offset 0x4000000 BOLT-INFO: enabling relocation mode BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it. BOLT-INFO: forcing -jump-tables=move for instrumentation BOLT-INFO: enabling -align-macro-fusion=all since no profile was specified BOLT-INFO: enabling lite mode BOLT-WARNING: Failed to analyze 1604 relocations BOLT-INFO: 0 out of 83941 functions in the binary (0.0%) have non-empty execution profile BOLT-INFO: the input contains 12033 (dynamic count : 0) opportunities for macro-fusion optimization that are going to be fixed BOLT-INSTRUMENTER: Number of indirect call site descriptors: 35504 BOLT-INSTRUMENTER: Number of indirect call target descriptors: 83483 BOLT-INSTRUMENTER: Number of function descriptors: 83475 BOLT-INSTRUMENTER: Number of branch counters: 1033785 BOLT-INSTRUMENTER: Number of ST leaf node counters: 483721 BOLT-INSTRUMENTER: Number of direct call counters: 0 BOLT-INSTRUMENTER: Total number of counters: 1517506 BOLT-INSTRUMENTER: Total size of counters: 12140048 bytes (static alloc memory) BOLT-INSTRUMENTER: Total size of string table emitted: 8224234 bytes in file BOLT-INSTRUMENTER: Total size of descriptors: 101280752 bytes in file BOLT-INSTRUMENTER: Profile will be saved to file /home/ubuntu/clang-build/llvm-build/stage3/bolt-prof-lld/lld.fdata BOLT-INFO: 728217 instructions were shortened BOLT-INFO: removed 121 empty blocks BOLT-INFO: UCE removed 442 blocks and 26956 bytes of code. BOLT-INFO: SCTC: patched 0 tail calls (0 forward) tail calls (0 backward) from a total of 0 while removing 0 double jumps and removing 0 basic blocks totalling 0 bytes of code. CTCs total execution count is 0 and the number of times CTCs are taken is 0. BOLT-INFO: output linked against instrumentation runtime library, lib entry point is 0xab1f4d0 BOLT-INFO: clear procedure is 0xab1a370

$ mv "${STAGE3}/lld" "${STAGE3}/lld.org" $ mv "${STAGE3}/lld.inst" "${STAGE3}/lld"

6. Compile linux kernel using BOLT instrumented binaries to generate profiles

Same as PGO training except used STAGE3 instrumented binaries

7. Merge clang fdata and optimize clang

$ CLANG_SUFFIX="clang-16" $ cd "$BOLT_PROFILES" $ "$STAGE1"/merge-fdata -q ./*.fdata >combined.fdata

Using legacy profile format. Merging data from ./clang-16.fdata.436378.fdata... Merging data from ./clang-16.fdata.438421.fdata... Merging data from ./clang-16.fdata.438462.fdata... ... Merging data from ./clang-16.fdata.788287.fdata... Merging data from ./clang-16.fdata.788288.fdata... Profile from 12331 files merged.

$ "$STAGE1"/llvm-bolt "${STAGE3}/${CLANG_SUFFIX}.org" \ --data "${BOLT_PROFILES}/combined.fdata" \ -o "${STAGE3}/${CLANG_SUFFIX}.bolt" \ --dyno-stats \ --eliminate-unreachable \ --frame-opt=hot \ --icf=1 \ --indirect-call-promotion=all \ --inline-all \ --inline-ap \ --jump-tables=aggressive \ --peepholes=all \ --plt=hot \ --reorder-blocks=ext-tsp \ --reorder-functions-use-hot-size \ --reorder-functions=hfsort+ \ --split-all-cold \ --split-eh \ --split-functions \ --tail-duplication=cache \ --thread-count="$(nproc --all)" \ --use-gnu-stack

BOLT-INFO: shared object or position-independent executable detected BOLT-INFO: Target architecture: x86_64 BOLT-INFO: BOLT version: 10cdad4065d7d3b53be3e0f03a2d71951c2bacd6 BOLT-INFO: first alloc address is 0x0 BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it. BOLT-INFO: enabling relocation mode BOLT-INFO: enabling lite mode BOLT-WARNING: Failed to analyze 1111 relocations BOLT-INFO: pre-processing profile using branch profile reader BOLT-INFO: 18379 out of 106312 functions in the binary (17.3%) have non-empty execution profile BOLT-INFO: 302 functions with profile could not be optimized BOLT-INFO: profile for 1 objects was ignored BOLT-INFO: the input contains 8572 (dynamic count : 29229461783) opportunities for macro-fusion optimization. Will fix instances on a hot path. BOLT-WARNING: 15 (0.1% of all profiled) functions have invalid (possibly stale) profile. Use -report-stale to see the list. BOLT-WARNING: 144552 out of 2138468131638 samples in the binary (0.0%) belong to functions with invalid (possibly stale) profile. BOLT-INFO: 694311 instructions were shortened BOLT-INFO: removed 846 empty blocks BOLT-INFO: ICF folded 2301 out of 106588 functions in 5 passes. 6 functions had jump tables. BOLT-INFO: Removing all identical functions will save 242.15 KB of code space. Folded functions were called 4352016281 times based on profile. BOLT-INFO: ICP Total indirect calls = 0, 0 callsites cover 99% of all indirect calls BOLT-INFO: ICP total indirect callsites with profile = 0 BOLT-INFO: ICP total jump table callsites = 0 BOLT-INFO: ICP total number of calls = 0 BOLT-INFO: ICP percentage of calls that are indirect = -nan% BOLT-INFO: ICP percentage of indirect calls that can be optimized = 0.0% BOLT-INFO: ICP percentage of indirect callsites that are optimized = 0.0% BOLT-INFO: ICP number of method load elimination candidates = 0 BOLT-INFO: ICP percentage of method calls candidates that have loads eliminated = 0.0% BOLT-INFO: ICP percentage of indirect branches that are optimized = 0.0% BOLT-INFO: ICP percentage of jump table callsites that are optimized = 0.0% BOLT-INFO: ICP number of jump table callsites that can use hot indices = 0 BOLT-INFO: ICP percentage of jump table callsites that use hot indices = 0.0% BOLT-INFO: inlined 501665876 calls at 18375 call sites in 3 iteration(s). Change in binary size: 2981468 bytes. BOLT-INFO: 23796 PLT calls in the binary were optimized. BOLT-INFO: basic block reordering modified layout of 9928 functions (54.02% of profiled, 9.52% of total) BOLT-INFO: UCE removed 112 blocks and 71 bytes of code. BOLT-INFO: splitting separates 12016856 hot bytes from 13208918 cold bytes (47.64% of split functions is hot). BOLT-INFO: 233 Functions were reordered by LoopInversionPass BOLT-INFO: tail duplication modified 2535 (2.43%) functions; duplicated 3818 blocks (51832 bytes) responsible for 2549059045 dynamic executions (0.12% of all block executions) BOLT-INFO: hfsort+ reduced the number of chains from 16380 to 9688 BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

   1357833952248 : executed forward branches
    180947030358 : taken forward branches
    267079843605 : executed backward branches
    154677391617 : taken backward branches
     64672148570 : executed unconditional branches
     86723229729 : all function calls
     29875752296 : indirect calls
     13865484243 : PLT calls
  10177524634287 : executed instructions
   2401095622828 : executed load instructions
   1046015823201 : executed store instructions
     25535247716 : taken jump table branches
               0 : taken unknown indirect branches
   1689585944423 : total branches
    400296570545 : taken branches
   1289289373878 : non-taken conditional branches
    335624421975 : taken conditional branches
   1624913795853 : all conditional branches

   1293449904292 : executed forward branches (-4.7%)
     74751821933 : taken forward branches (-58.7%)
    331437902583 : executed backward branches (+24.1%)
    151721946307 : taken backward branches (-1.9%)
     44428998543 : executed unconditional branches (-31.3%)
     72362545087 : all function calls (-16.6%)
     29875752271 : indirect calls (+0.0%)
               0 : PLT calls (-100.0%)
  10096329939759 : executed instructions (-0.8%)
   2401067889103 : executed load instructions (-0.0%)
   1046011776511 : executed store instructions (-0.0%)
     25535247716 : taken jump table branches (=)
               0 : taken unknown indirect branches (=)
   1669316805418 : total branches (-1.2%)
    270902766783 : taken branches (-32.3%)
   1398414038635 : non-taken conditional branches (+8.5%)
    226473768240 : taken conditional branches (-32.5%)
   1624887806875 : all conditional branches (-0.0%)

BOLT-INFO: SCTC: patched 140 tail calls (127 forward) tail calls (13 backward) from a total of 140 while removing 22 double jumps and removing 115 basic blocks totalling 563 bytes of code. CTCs total execution count is 9679585 and the number of times CTCs are taken is 3567626. BOLT-INFO: Peephole: 17774 double jumps patched. BOLT-INFO: Peephole: 1225 tail call traps inserted. BOLT-INFO: Peephole: 0 useless conditional branches removed. BOLT-INFO: FOP optimized 44 redundant load(s) and 0 unused store(s) BOLT-INFO: Frequency of redundant loads is 731971700 and frequency of unused stores is 0 BOLT-INFO: Frequency of loads changed to use a register is 731971700 and frequency of loads changed to use an immediate is 0 BOLT-INFO: FOP deleted 26 load(s) (dyn count: 266998253) and 0 store(s) BOLT-INFO: FRAME ANALYSIS: 88209 function(s) were not optimized. BOLT-INFO: FRAME ANALYSIS: 8253 function(s) (91.6% dyn cov) could not have its frame indices restored. BOLT-INFO: Shrink wrapping moved 196 spills inserting load/stores and 33 spills inserting push/pops BOLT-INFO: Shrink wrapping reduced 9719675412 store executions (0.1% total instructions executed, 0.9% store instructions) BOLT-INFO: Shrink wrapping failed at reducing 0 store executions (0.0% total instructions executed, 0.0% store instructions) BOLT-INFO: Allocation combiner: 311 empty spaces coalesced (dyn count: 9879792604). BOLT-INFO: setting __hot_start to 0x5c00000 BOLT-INFO: setting __hot_end to 0x6bc12e7

$ rm -rf "${STAGE3}/${CLANG_SUFFIX:?}" $ mv "${STAGE3}/${CLANG_SUFFIX}.bolt" "${STAGE3}/${CLANG_SUFFIX}"

8. Merge lld fdata and optimize lld

$ cd "$BOLT_PROFILES_LLD" $ "$STAGE1"/merge-fdata -q ./*.fdata >combined.fdata

Using legacy profile format. Merging data from ./lld.fdata.438648.fdata... Merging data from ./lld.fdata.439049.fdata... Merging data from ./lld.fdata.439231.fdata... ... Merging data from ./lld.fdata.787498.fdata... Merging data from ./lld.fdata.788215.fdata... Merging data from ./lld.fdata.788300.fdata... Profile from 1430 files merged.

$ "$STAGE1"/llvm-bolt "${STAGE3}/lld.org" \ --data "${BOLT_PROFILES}/combined.fdata" \ -o "${STAGE3}/lld.bolt" \ --dyno-stats \ --eliminate-unreachable \ --frame-opt=hot \ --icf=1 \ --indirect-call-promotion=all \ --inline-all \ --inline-ap \ --jump-tables=aggressive \ --peepholes=all \ --plt=hot \ --reorder-blocks=ext-tsp \ --reorder-functions-use-hot-size \ --reorder-functions=hfsort+ \ --split-all-cold \ --split-eh \ --split-functions \ --tail-duplication=cache \ --thread-count="$(nproc --all)" \ --use-gnu-stack

BOLT-INFO: shared object or position-independent executable detected BOLT-INFO: Target architecture: x86_64 BOLT-INFO: BOLT version: 10cdad4065d7d3b53be3e0f03a2d71951c2bacd6 BOLT-INFO: first alloc address is 0x0 BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it. BOLT-INFO: enabling relocation mode BOLT-INFO: enabling lite mode BOLT-WARNING: Failed to analyze 1604 relocations BOLT-INFO: pre-processing profile using branch profile reader BOLT-INFO: 2460 out of 83941 functions in the binary (2.9%) have non-empty execution profile BOLT-INFO: 7 functions with profile could not be optimized BOLT-INFO: profile for 1 objects was ignored BOLT-INFO: the input contains 299 (dynamic count : 76750278) opportunities for macro-fusion optimization. Will fix instances on a hot path. BOLT-WARNING: 3 (0.1% of all profiled) functions have invalid (possibly stale) profile. Use -report-stale to see the list. BOLT-WARNING: 61853 out of 12680030856 samples in the binary (0.0%) belong to functions with invalid (possibly stale) profile. BOLT-INFO: 21289 instructions were shortened BOLT-INFO: removed 42 empty blocks BOLT-INFO: ICF folded 546 out of 84283 functions in 4 passes. 16 functions had jump tables. BOLT-INFO: Removing all identical functions will save 113.47 KB of code space. Folded functions were called 3019425 times based on profile. BOLT-INFO: ICP Total indirect calls = 0, 0 callsites cover 99% of all indirect calls BOLT-INFO: ICP total indirect callsites with profile = 0 BOLT-INFO: ICP total jump table callsites = 0 BOLT-INFO: ICP total number of calls = 0 BOLT-INFO: ICP percentage of calls that are indirect = -nan% BOLT-INFO: ICP percentage of indirect calls that can be optimized = 0.0% BOLT-INFO: ICP percentage of indirect callsites that are optimized = 0.0% BOLT-INFO: ICP number of method load elimination candidates = 0 BOLT-INFO: ICP percentage of method calls candidates that have loads eliminated = 0.0% BOLT-INFO: ICP percentage of indirect branches that are optimized = 0.0% BOLT-INFO: ICP percentage of jump table callsites that are optimized = 0.0% BOLT-INFO: ICP number of jump table callsites that can use hot indices = 0 BOLT-INFO: ICP percentage of jump table callsites that use hot indices = 0.0% BOLT-INFO: inlined 1261411 calls at 187 call sites in 3 iteration(s). Change in binary size: 52753 bytes. BOLT-INFO: 4120 PLT calls in the binary were optimized. BOLT-INFO: basic block reordering modified layout of 596 functions (24.23% of profiled, 0.71% of total) BOLT-INFO: UCE removed 8 blocks and 134 bytes of code. BOLT-INFO: splitting separates 805853 hot bytes from 570071 cold bytes (58.57% of split functions is hot). BOLT-INFO: 15 Functions were reordered by LoopInversionPass BOLT-INFO: tail duplication modified 79 (0.09%) functions; duplicated 103 blocks (1231 bytes) responsible for 6373412 dynamic executions (0.05% of all block executions) BOLT-INFO: hfsort+ reduced the number of chains from 1921 to 884 BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

      5754874521 : executed forward branches
       587846482 : taken forward branches
      4554252855 : executed backward branches
      2472140715 : taken backward branches
       302702619 : executed unconditional branches
       529677146 : all function calls
       284177322 : indirect calls
       194011090 : PLT calls
     70984080533 : executed instructions
     14605957208 : executed load instructions
      3422235271 : executed store instructions
       124639302 : taken jump table branches
               0 : taken unknown indirect branches
     10611829995 : total branches
      3362689816 : taken branches
      7249140179 : non-taken conditional branches
      3059987197 : taken conditional branches
     10309127376 : all conditional branches

      3830096828 : executed forward branches (-33.4%)
       119124972 : taken forward branches (-79.7%)
      6479030473 : executed backward branches (+42.3%)
      2465978700 : taken backward branches (-0.2%)
       169589420 : executed unconditional branches (-44.0%)
       334404676 : all function calls (-36.9%)
       284177322 : indirect calls (=)
               0 : PLT calls (-100.0%)
     70386024690 : executed instructions (-0.8%)
     14605957185 : executed load instructions (+0.0%)
      3422235261 : executed store instructions (-0.0%)
       124639302 : taken jump table branches (=)
               0 : taken unknown indirect branches (=)
     10478716721 : total branches (-1.3%)
      2754693092 : taken branches (-18.1%)
      7724023629 : non-taken conditional branches (+6.6%)
      2585103672 : taken conditional branches (-15.5%)
     10309127301 : all conditional branches (+0.0%)

BOLT-INFO: SCTC: patched 16 tail calls (16 forward) tail calls (0 backward) from a total of 16 while removing 1 double jumps and removing 15 basic blocks totalling 75 bytes of code. CTCs total execution count is 21455 and the number of times CTCs are taken is 2. BOLT-INFO: Peephole: 13 double jumps patched. BOLT-INFO: Peephole: 222 tail call traps inserted. BOLT-INFO: Peephole: 0 useless conditional branches removed. BOLT-INFO: FOP optimized 8 redundant load(s) and 0 unused store(s) BOLT-INFO: Frequency of redundant loads is 78455 and frequency of unused stores is 0 BOLT-INFO: Frequency of loads changed to use a register is 78455 and frequency of loads changed to use an immediate is 0 BOLT-INFO: FOP deleted 5 load(s) (dyn count: 0) and 0 store(s) BOLT-INFO: FRAME ANALYSIS: 81823 function(s) were not optimized. BOLT-INFO: FRAME ANALYSIS: 990 function(s) (91.4% dyn cov) could not have its frame indices restored. BOLT-INFO: Shrink wrapping moved 24 spills inserting load/stores and 0 spills inserting push/pops BOLT-INFO: Shrink wrapping reduced 13594573 store executions (0.0% total instructions executed, 0.4% store instructions) BOLT-INFO: Shrink wrapping failed at reducing 0 store executions (0.0% total instructions executed, 0.0% store instructions) BOLT-INFO: Allocation combiner: 26 empty spaces coalesced (dyn count: 7768804). BOLT-INFO: padding code to 0x4200000 to accommodate hot text BOLT-INFO: setting __hot_start to 0x4000000 BOLT-INFO: setting __hot_end to 0x40e914c

$ rm -rf "${STAGE3}/lld" $ mv "${STAGE3}/lld.bolt" "${STAGE3}/lld"

9. Compile linux kernel using stage 3 BOLT-ed binaries

$ make distclean defconfig all -sj"$(nproc --all)" LLVM=1 LLVM_IAS=1

PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script. Stack dump:

  1. Program arguments: clang -dM -E -x c /dev/null

    0 0x0000562703dc4d62 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x3dc4d62)

    1 0x0000562703dc4a6a (/home/ubuntu/clang-build/neutron/bin/clang-16+0x3dc4a6a)

    2 0x0000562703d92f15 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x3d92f15)

    3 0x0000562703d92dde (/home/ubuntu/clang-build/neutron/bin/clang-16+0x3d92dde)

    4 0x00007f8366843a00 (/home/ubuntu/glibc/usr/lib/libc.so.6+0x38a00)

    5 0x00007f836696fd0d (/home/ubuntu/glibc/usr/lib/libc.so.6+0x164d0d)

    6 0x0000562705d92183 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x5d92183)

    7 0x000056270618890b (/home/ubuntu/clang-build/neutron/bin/clang-16+0x618890b)

    8 0x0000562706188eaa (/home/ubuntu/clang-build/neutron/bin/clang-16+0x6188eaa)

    9 0x0000562706839d5f (/home/ubuntu/clang-build/neutron/bin/clang-16+0x6839d5f)

    10 0x0000562706839895 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x6839895)

    11 0x00005627064a1166 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x64a1166)

    12 0x000056270649f5f7 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x649f5f7)

    13 0x000056270642e292 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x642e292)

    14 0x0000562706728f40 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x6728f40)

    15 0x0000562706728ded (/home/ubuntu/clang-build/neutron/bin/clang-16+0x6728ded)

    16 0x00005627065f8c93 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x65f8c93)

    17 0x00005627065f8392 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x65f8392)

    18 0x00007f836682e290 (/home/ubuntu/glibc/usr/lib/libc.so.6+0x23290)

    19 0x00007f836682e34a __libc_start_main (/home/ubuntu/glibc/usr/lib/libc.so.6+0x2334a)

    20 0x00005627064b9d25 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x64b9d25)

    clang-16: error: clang frontend command failed with exit code 139 (use -v to see invocation) Neutron clang version 16.0.0 (https://github.com/llvm/llvm-project.git 10cdad4065d7d3b53be3e0f03a2d71951c2bacd6) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /home/ubuntu/clang-build/neutron/bin clang-16: error: unable to execute command: Segmentation fault (core dumped) clang-16: note: diagnostic msg: Error generating preprocessed source(s). PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script. Stack dump:

  2. Program arguments: clang -E -x c -

    0 0x0000565384bc4d62 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x3dc4d62)

    1 0x0000565384bc4a6a (/home/ubuntu/clang-build/neutron/bin/clang-16+0x3dc4a6a)

    2 0x0000565384b92f15 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x3d92f15)

    3 0x0000565384b92dde (/home/ubuntu/clang-build/neutron/bin/clang-16+0x3d92dde)

    4 0x00007f2cb1c43a00 (/home/ubuntu/glibc/usr/lib/libc.so.6+0x38a00)

    5 0x00007f2cb1d6fd0d (/home/ubuntu/glibc/usr/lib/libc.so.6+0x164d0d)

    6 0x0000565386b92183 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x5d92183)

    7 0x0000565386f8890b (/home/ubuntu/clang-build/neutron/bin/clang-16+0x618890b)

    8 0x0000565386f88eaa (/home/ubuntu/clang-build/neutron/bin/clang-16+0x6188eaa)

    9 0x0000565387639d5f (/home/ubuntu/clang-build/neutron/bin/clang-16+0x6839d5f)

    10 0x0000565387639895 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x6839895)

    11 0x00005653872a1166 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x64a1166)

    12 0x000056538729f5f7 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x649f5f7)

    13 0x000056538722e292 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x642e292)

    14 0x0000565387528f40 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x6728f40)

    15 0x0000565387528ded (/home/ubuntu/clang-build/neutron/bin/clang-16+0x6728ded)

    16 0x00005653873f8c93 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x65f8c93)

    17 0x00005653873f8392 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x65f8392)

    18 0x00007f2cb1c2e290 (/home/ubuntu/glibc/usr/lib/libc.so.6+0x23290)

    19 0x00007f2cb1c2e34a __libc_start_main (/home/ubuntu/glibc/usr/lib/libc.so.6+0x2334a)

    20 0x00005653872b9d25 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x64b9d25)

    clang-16: error: clang frontend command failed with exit code 139 (use -v to see invocation) Neutron clang version 16.0.0 (https://github.com/llvm/llvm-project.git 10cdad4065d7d3b53be3e0f03a2d71951c2bacd6) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /home/ubuntu/clang-build/neutron/bin clang-16: note: diagnostic msg: Error generating preprocessed source(s) - ignoring input from stdin. clang-16: note: diagnostic msg: Error generating preprocessed source(s) - no preprocessable inputs. HOSTCC scripts/basic/fixdep PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script. Stack dump:

  3. Program arguments: /home/ubuntu/clang-build/neutron/bin/clang-16 -cc1 -triple x86_64-unknown-linux-gnu -emit-obj -disable-free -clear-ast-before-backend -disable-llvm-verifier -discard-value-names -main-file-name fixdep.c -mrelocation-model pic -pic-level 2 -pic-is-pie -mframe-pointer=none -fmath-errno -ffp-contract=on -fno-rounding-math -mconstructor-aliases -funwind-tables=2 -target-cpu x86-64 -tune-cpu generic -mllvm -treat-scalable-fixed-error-as-warning -debugger-tuning=gdb -fcoverage-compilation-dir=/home/ubuntu/clang-build/linux-6.1.6 -resource-dir /home/ubuntu/clang-build/neutron/lib/clang/16 -dependency-file scripts/basic/.fixdep.d -MT scripts/basic/fixdep -internal-isystem /home/ubuntu/clang-build/neutron/lib/clang/16/include -internal-isystem /usr/local/include -internal-isystem /usr/lib/gcc/x86_64-linux-gnu/11/../../../../x86_64-linux-gnu/include -internal-externc-isystem /usr/include/x86_64-linux-gnu -internal-externc-isystem /include -internal-externc-isystem /usr/include -O2 -Wall -Wmissing-prototypes -Wstrict-prototypes -Wdeclaration-after-statement -std=gnu11 -fdebug-compilation-dir=/home/ubuntu/clang-build/linux-6.1.6 -ferror-limit 19 -fgnuc-version=4.2.1 -fcolor-diagnostics -vectorize-loops -vectorize-slp -faddrsig -D__GCC_HAVE_DWARF2_CFI_ASM=1 -o /tmp/fixdep-c24c46.o -x c scripts/basic/fixdep.c

    0 0x0000558171dc4d62 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x3dc4d62)

    1 0x000055817150641a (/home/ubuntu/clang-build/neutron/bin/clang-16+0x350641a)

    2 0x00007fbc17356a00 (/home/ubuntu/glibc/usr/lib/libc.so.6+0x38a00)

    3 0x00007fbc17482d0d (/home/ubuntu/glibc/usr/lib/libc.so.6+0x164d0d)

    4 0x0000558173d92183 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x5d92183)

    5 0x000055817418890b (/home/ubuntu/clang-build/neutron/bin/clang-16+0x618890b)

    6 0x0000558174188eaa (/home/ubuntu/clang-build/neutron/bin/clang-16+0x6188eaa)

    7 0x0000558174839d5f (/home/ubuntu/clang-build/neutron/bin/clang-16+0x6839d5f)

    8 0x0000558174839895 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x6839895)

    9 0x00005581744a1166 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x64a1166)

    10 0x000055817449f5f7 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x649f5f7)

    11 0x00005581745f8664 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x65f8664)

    12 0x00007fbc17341290 (/home/ubuntu/glibc/usr/lib/libc.so.6+0x23290)

    13 0x00007fbc1734134a __libc_start_main (/home/ubuntu/glibc/usr/lib/libc.so.6+0x2334a)

    14 0x00005581744b9d25 (/home/ubuntu/clang-build/neutron/bin/clang-16+0x64b9d25)

    clang-16: error: unable to execute command: Segmentation fault (core dumped) clang-16: error: clang frontend command failed due to signal (use -v to see invocation) Neutron clang version 16.0.0 (https://github.com/llvm/llvm-project.git 10cdad4065d7d3b53be3e0f03a2d71951c2bacd6) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /home/ubuntu/clang-build/neutron/bin clang-16: error: unable to execute command: Segmentation fault (core dumped) clang-16: note: diagnostic msg: Error generating preprocessed source(s). make[2]: [scripts/Makefile.host:111: scripts/basic/fixdep] Error 1 make[1]: [Makefile:633: scripts_basic] Error 2 make: *** [Makefile:362: __build_one_by_one] Error 2

Tried using vanilla clang with BOLT optimized lld:

  HOSTCC  scripts/basic/fixdep
  HOSTCC  scripts/kconfig/conf.o
  HOSTCC  scripts/kconfig/confdata.o
  HOSTCC  scripts/kconfig/expr.o
  LEX     scripts/kconfig/lexer.lex.c
  YACC    scripts/kconfig/parser.tab.[ch]
  HOSTCC  scripts/kconfig/menu.o
  HOSTCC  scripts/kconfig/preprocess.o
  HOSTCC  scripts/kconfig/symbol.o
  HOSTCC  scripts/kconfig/util.o
  HOSTCC  scripts/kconfig/lexer.lex.o
  HOSTCC  scripts/kconfig/parser.tab.o
  HOSTLD  scripts/kconfig/conf
*** Default configuration is based on 'x86_64_defconfig'
#
# configuration written to .config
#
  SYSHDR  arch/x86/include/generated/uapi/asm/unistd_32.h
  SYSHDR  arch/x86/include/generated/uapi/asm/unistd_64.h
  SYSHDR  arch/x86/include/generated/uapi/asm/unistd_x32.h
...
  AR      drivers/acpi/pmic/built-in.a
  CC      drivers/acpi/dptf/int340x_thermal.o
free(): double free detected in tcache 2
  CC      security/selinux/ss/symtab.o
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.  Program arguments: ld.lld -m elf_x86_64 -z noexecstack -m elf_i386 --emit-relocs -T arch/x86/realmode/rm/realmode.lds arch/x86/realmode/rm/header.o arch/x86/realmode/rm/trampoline_64.o arch/x86/realmode/rm/stack.o arch/x86/realmode/rm/reboot.o arch/x86/realmode/rm/wakeup_asm.o arch/x86/realmode/rm/wakemain.o arch/x86/realmode/rm/video-mode.o arch/x86/realmode/rm/copy.o arch/x86/realmode/rm/bioscall.o arch/x86/realmode/rm/regs.o arch/x86/realmode/rm/video-vga.o arch/x86/realmode/rm/video-vesa.o arch/x86/realmode/rm/video-bios.o -o arch/x86/realmode/rm/realmode.elf
  CC      sound/pci/hda/hda_controller.o
  AR      fs/devpts/built-in.a
  CC      drivers/acpi/acpica/dsdebug.o
  CC      sound/pci/hda/hda_proc.o
 #0 0x000055e530d2d5e7 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/ubuntu/clang-build/neutron/bin/lld+0x272d5e7)
 #1 0x000055e530c01b3a (/home/ubuntu/clang-build/neutron/bin/lld+0x2601b3a)
 #2 0x00007fb55ca43a00 (/home/ubuntu/glibc/usr/lib/libc.so.6+0x38a00)
 #3 0x00007fb55ca9364c (/home/ubuntu/glibc/usr/lib/libc.so.6+0x8864c)
 #4 0x00007fb55ca43958 raise (/home/ubuntu/glibc/usr/lib/libc.so.6+0x38958)
 #5 0x00007fb55ca2d53d abort (/home/ubuntu/glibc/usr/lib/libc.so.6+0x2253d)
 #6 0x00007fb55ca877ee (/home/ubuntu/glibc/usr/lib/libc.so.6+0x7c7ee)
 #7 0x00007fb55ca9d3dc (/home/ubuntu/glibc/usr/lib/libc.so.6+0x923dc)
 #8 0x00007fb55ca9f737 (/home/ubuntu/glibc/usr/lib/libc.so.6+0x94737)
 #9 0x00007fb55caa1ba3 cfree (/home/ubuntu/glibc/usr/lib/libc.so.6+0x96ba3)
#10 0x000055e5327669a1 (/home/ubuntu/clang-build/neutron/bin/lld+0x41669a1)
#11 0x000055e5326118a0 lld::elf::LinkerDriver::link(llvm::opt::InputArgList&) (/home/ubuntu/clang-build/neutron/bin/lld+0x40118a0)
#12 0x000055e5326277c9 lld::elf::LinkerDriver::linkerMain(llvm::ArrayRef<char const*>) (/home/ubuntu/clang-build/neutron/bin/lld+0x40277c9)
#13 0x000055e5326bb434 lld::elf::link(llvm::ArrayRef<char const*>, llvm::raw_ostream&, llvm::raw_ostream&, bool, bool) (/home/ubuntu/clang-build/neutron/bin/lld+0x40bb434)
#14 0x000055e53264129e (/home/ubuntu/clang-build/neutron/bin/lld+0x404129e)
#15 0x000055e53272f563 (/home/ubuntu/clang-build/neutron/bin/lld+0x412f563)
#16 0x00007fb55ca2e290 (/home/ubuntu/glibc/usr/lib/libc.so.6+0x23290)
#17 0x00007fb55ca2e34a __libc_start_main (/home/ubuntu/glibc/usr/lib/libc.so.6+0x2334a)
#18 0x000055e532624b65 _start (/home/ubuntu/clang-build/neutron/bin/lld+0x4024b65)
  AS      arch/x86/lib/memset_64.o
  CC      drivers/video/fbdev/core/fb_cmdline.o
...
  CC      drivers/acpi/acpica/dsfield.o
Aborted (core dumped)
make[5]: *** [arch/x86/realmode/rm/Makefile:58: arch/x86/realmode/rm/realmode.elf] Error 134
make[4]: *** [arch/x86/realmode/Makefile:22: arch/x86/realmode/rm/realmode.bin] Error 2
make[3]: *** [scripts/Makefile.build:500: arch/x86/realmode] Error 2
make[3]: *** Waiting for unfinished jobs...

Complete script: https://github.com/Neutron-Toolchains/clang-build/blob/main/build_clang.sh

llvmbot commented 1 year ago

@llvm/issue-subscribers-bolt

dakkshesh07 commented 1 year ago

Removing --tail-duplication=cache from BOLT flags appears to resolve the segfault for clang and lld. I've included the llvm-bolt and kernel compilation output below:

  1. clang:
    
    $ "$STAGE1"/llvm-bolt "${STAGE3}/${CLANG_SUFFIX}.org" \
            --data "${BOLT_PROFILES}/combined.fdata" \
            -o "${STAGE3}/${CLANG_SUFFIX}.bolt" \
            --dyno-stats \
            --eliminate-unreachable \
            --frame-opt=hot \
            --icf=1 \
            --indirect-call-promotion=all \
            --inline-all \
            --inline-ap \
            --jump-tables=aggressive \
            --peepholes=all \
            --plt=hot \
            --reorder-blocks=ext-tsp \
            --reorder-functions-use-hot-size \
            --reorder-functions=hfsort+ \
            --split-all-cold \
            --split-eh \
            --split-functions \
            --thread-count="$(nproc --all)" \
            --use-gnu-stack

BOLT-INFO: shared object or position-independent executable detected BOLT-INFO: Target architecture: x86_64 BOLT-INFO: BOLT version: a518425ddc2ae6c627cd3d99b085063c9a791f34 BOLT-INFO: first alloc address is 0x0 BOLT-INFO: enabling relocation mode BOLT-INFO: enabling lite mode BOLT-WARNING: Failed to analyze 1111 relocations BOLT-INFO: pre-processing profile using branch profile reader BOLT-INFO: 17423 out of 106191 functions in the binary (16.4%) have non-empty execution profile BOLT-INFO: 299 functions with profile could not be optimized BOLT-WARNING: 10 (0.1% of all profiled) functions have invalid (possibly stale) profile. Use -report-stale to see the list. BOLT-WARNING: 84833 out of 1815214515850 samples in the binary (0.0%) belong to functions with invalid (possibly stale) profile. BOLT-INFO: profile for 1 objects was ignored BOLT-INFO: the input contains 8512 (dynamic count : 10549922383) opportunities for macro-fusion optimization. Will fix instances on a hot path. BOLT-INFO: 684293 instructions were shortened BOLT-INFO: removed 831 empty blocks BOLT-INFO: ICF folded 2188 out of 106468 functions in 5 passes. 6 functions had jump tables. BOLT-INFO: Removing all identical functions will save 228.09 KB of code space. Folded functions were called 3222672052 times based on profile. BOLT-INFO: ICP Total indirect calls = 0, 0 callsites cover 99% of all indirect calls BOLT-INFO: ICP total indirect callsites with profile = 0 BOLT-INFO: ICP total jump table callsites = 0 BOLT-INFO: ICP total number of calls = 0 BOLT-INFO: ICP percentage of calls that are indirect = -nan% BOLT-INFO: ICP percentage of indirect calls that can be optimized = 0.0% BOLT-INFO: ICP percentage of indirect callsites that are optimized = 0.0% BOLT-INFO: ICP number of method load elimination candidates = 0 BOLT-INFO: ICP percentage of method calls candidates that have loads eliminated = 0.0% BOLT-INFO: ICP percentage of indirect branches that are optimized = 0.0% BOLT-INFO: ICP percentage of jump table callsites that are optimized = 0.0% BOLT-INFO: ICP number of jump table callsites that can use hot indices = 0 BOLT-INFO: ICP percentage of jump table callsites that use hot indices = 0.0% BOLT-INFO: inlined 342712951 calls at 17489 call sites in 3 iteration(s). Change in binary size: 2826244 bytes. BOLT-INFO: 22549 PLT calls in the binary were optimized. BOLT-INFO: basic block reordering modified layout of 9485 functions (54.44% of profiled, 9.10% of total) BOLT-INFO: UCE removed 109 blocks and 126 bytes of code. BOLT-INFO: splitting separates 11375475 hot bytes from 13318689 cold bytes (46.07% of split functions is hot). BOLT-INFO: 238 Functions were reordered by LoopInversionPass BOLT-INFO: hfsort+ reduced the number of chains from 15534 to 9137 BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

   1132711919183 : executed forward branches
    146641320741 : taken forward branches
    233183325443 : executed backward branches
    139369228904 : taken backward branches
     61919616927 : executed unconditional branches
     65280779088 : all function calls
     23870694670 : indirect calls
     11585683218 : PLT calls
   8636967142196 : executed instructions
   2042350553565 : executed load instructions
    940027921335 : executed store instructions
     23023985070 : taken jump table branches
               0 : taken unknown indirect branches
   1427814861553 : total branches
    347930166572 : taken branches
   1079884694981 : non-taken conditional branches
    286010549645 : taken conditional branches
   1365895244626 : all conditional branches

   1083149672002 : executed forward branches (-4.4%)
     65655962029 : taken forward branches (-55.2%)
    282730249318 : executed backward branches (+21.2%)
    132035534661 : taken backward branches (-5.3%)
     38825992478 : executed unconditional branches (-37.3%)
     53356151273 : all function calls (-18.3%)
     23870694650 : indirect calls (+0.0%)
               0 : PLT calls (-100.0%)
   8561887431021 : executed instructions (-0.9%)
   2042335402742 : executed load instructions (-0.0%)
    940026242431 : executed store instructions (-0.0%)
     23023985070 : taken jump table branches (=)
               0 : taken unknown indirect branches (=)
   1404705913798 : total branches (-1.6%)
    236517489168 : taken branches (-32.0%)
   1168188424630 : non-taken conditional branches (+8.2%)
    197691496690 : taken conditional branches (-30.9%)
   1365879921320 : all conditional branches (-0.0%)

BOLT-INFO: SCTC: patched 126 tail calls (113 forward) tail calls (13 backward) from a total of 126 while removing 21 double jumps and removing 101 basic blocks totalling 493 bytes of code. CTCs total execution count is 5427712 and the number of times CTCs are taken is 1958348. BOLT-INFO: Peephole: 17293 double jumps patched. BOLT-INFO: Peephole: 1146 tail call traps inserted. BOLT-INFO: Peephole: 0 useless conditional branches removed. BOLT-INFO: FOP optimized 49 redundant load(s) and 0 unused store(s) BOLT-INFO: Frequency of redundant loads is 730163822 and frequency of unused stores is 0 BOLT-INFO: Frequency of loads changed to use a register is 730163822 and frequency of loads changed to use an immediate is 0 BOLT-INFO: FOP deleted 28 load(s) (dyn count: 261658016) and 0 store(s) BOLT-INFO: FRAME ANALYSIS: 89045 function(s) were not optimized. BOLT-INFO: FRAME ANALYSIS: 7906 function(s) (92.4% dyn cov) could not have its frame indices restored. BOLT-INFO: Shrink wrapping moved 180 spills inserting load/stores and 32 spills inserting push/pops BOLT-INFO: Shrink wrapping reduced 6518071510 store executions (0.1% total instructions executed, 0.7% store instructions) BOLT-INFO: Shrink wrapping failed at reducing 0 store executions (0.0% total instructions executed, 0.0% store instructions) BOLT-INFO: Allocation combiner: 234 empty spaces coalesced (dyn count: 6370053754). BOLT-INFO: setting __hot_start to 0x5c00000 BOLT-INFO: setting __hot_end to 0x6afd7a7


2. lld:

$ "$STAGE1"/llvm-bolt "${STAGE3}/lld.org" \ --data "${BOLT_PROFILES}/combined.fdata" \ -o "${STAGE3}/lld.bolt" \ --dyno-stats \ --eliminate-unreachable \ --frame-opt=hot \ --icf=1 \ --indirect-call-promotion=all \ --inline-all \ --inline-ap \ --jump-tables=aggressive \ --peepholes=all \ --plt=hot \ --reorder-blocks=ext-tsp \ --reorder-functions-use-hot-size \ --reorder-functions=hfsort+ \ --split-all-cold \ --split-eh \ --split-functions \ --thread-count="$(nproc --all)" \ --use-gnu-stack

BOLT-INFO: shared object or position-independent executable detected BOLT-INFO: Target architecture: x86_64 BOLT-INFO: BOLT version: a518425ddc2ae6c627cd3d99b085063c9a791f34 BOLT-INFO: first alloc address is 0x0 BOLT-INFO: enabling relocation mode BOLT-INFO: enabling lite mode BOLT-WARNING: Failed to analyze 1602 relocations BOLT-INFO: pre-processing profile using branch profile reader BOLT-INFO: 2365 out of 83829 functions in the binary (2.8%) have non-empty execution profile BOLT-INFO: 7 functions with profile could not be optimized BOLT-WARNING: 3 (0.1% of all profiled) functions have invalid (possibly stale) profile. Use -report-stale to see the list. BOLT-WARNING: 40101 out of 10963221173 samples in the binary (0.0%) belong to functions with invalid (possibly stale) profile. BOLT-INFO: profile for 1 objects was ignored BOLT-INFO: the input contains 249 (dynamic count : 65290823) opportunities for macro-fusion optimization. Will fix instances on a hot path. BOLT-INFO: 20462 instructions were shortened BOLT-INFO: removed 40 empty blocks BOLT-INFO: ICF folded 541 out of 84172 functions in 4 passes. 14 functions had jump tables. BOLT-INFO: Removing all identical functions will save 113.03 KB of code space. Folded functions were called 2929068 times based on profile. BOLT-INFO: ICP Total indirect calls = 0, 0 callsites cover 99% of all indirect calls BOLT-INFO: ICP total indirect callsites with profile = 0 BOLT-INFO: ICP total jump table callsites = 0 BOLT-INFO: ICP total number of calls = 0 BOLT-INFO: ICP percentage of calls that are indirect = -nan% BOLT-INFO: ICP percentage of indirect calls that can be optimized = 0.0% BOLT-INFO: ICP percentage of indirect callsites that are optimized = 0.0% BOLT-INFO: ICP number of method load elimination candidates = 0 BOLT-INFO: ICP percentage of method calls candidates that have loads eliminated = 0.0% BOLT-INFO: ICP percentage of indirect branches that are optimized = 0.0% BOLT-INFO: ICP percentage of jump table callsites that are optimized = 0.0% BOLT-INFO: ICP number of jump table callsites that can use hot indices = 0 BOLT-INFO: ICP percentage of jump table callsites that use hot indices = 0.0% BOLT-INFO: inlined 510164 calls at 159 call sites in 2 iteration(s). Change in binary size: 44489 bytes. BOLT-INFO: 4052 PLT calls in the binary were optimized. BOLT-INFO: basic block reordering modified layout of 555 functions (23.47% of profiled, 0.66% of total) BOLT-INFO: UCE removed 5 blocks and 162 bytes of code. BOLT-INFO: splitting separates 786253 hot bytes from 550330 cold bytes (58.83% of split functions is hot). BOLT-INFO: 13 Functions were reordered by LoopInversionPass BOLT-INFO: hfsort+ reduced the number of chains from 1831 to 915 BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

      4889140867 : executed forward branches
       528233056 : taken forward branches
      4055672976 : executed backward branches
      2175298219 : taken backward branches
       291508038 : executed unconditional branches
       481170769 : all function calls
       260705669 : indirect calls
       179518909 : PLT calls
     61695206397 : executed instructions
     12535070163 : executed load instructions
      2834782985 : executed store instructions
       113913041 : taken jump table branches
               0 : taken unknown indirect branches
      9236321881 : total branches
      2995039313 : taken branches
      6241282568 : non-taken conditional branches
      2703531275 : taken conditional branches
      8944813843 : all conditional branches

      3286549655 : executed forward branches (-32.8%)
        80716133 : taken forward branches (-84.7%)
      5658264178 : executed backward branches (+39.5%)
      2176420576 : taken backward branches (+0.1%)
       155348305 : executed unconditional branches (-46.7%)
       301141727 : all function calls (-37.4%)
       260705669 : indirect calls (=)
               0 : PLT calls (-100.0%)
     61150827226 : executed instructions (-0.9%)
     12535070220 : executed load instructions (+0.0%)
      2834782977 : executed store instructions (+0.0%)
       113913041 : taken jump table branches (=)
               0 : taken unknown indirect branches (=)
      9100162138 : total branches (-1.5%)
      2412485014 : taken branches (-19.5%)
      6687677124 : non-taken conditional branches (+7.2%)
      2257136709 : taken conditional branches (-16.5%)
      8944813833 : all conditional branches (+0.0%)

BOLT-INFO: SCTC: patched 11 tail calls (11 forward) tail calls (0 backward) from a total of 11 while removing 0 double jumps and removing 11 basic blocks totalling 55 bytes of code. CTCs total execution count is 25074 and the number of times CTCs are taken is 0. BOLT-INFO: Peephole: 6 double jumps patched. BOLT-INFO: Peephole: 221 tail call traps inserted. BOLT-INFO: Peephole: 0 useless conditional branches removed. BOLT-INFO: FOP optimized 30 redundant load(s) and 0 unused store(s) BOLT-INFO: Frequency of redundant loads is 77478 and frequency of unused stores is 0 BOLT-INFO: Frequency of loads changed to use a register is 77478 and frequency of loads changed to use an immediate is 0 BOLT-INFO: FOP deleted 29 load(s) (dyn count: 44) and 0 store(s) BOLT-INFO: FRAME ANALYSIS: 81807 function(s) were not optimized. BOLT-INFO: FRAME ANALYSIS: 947 function(s) (94.4% dyn cov) could not have its frame indices restored. BOLT-INFO: Shrink wrapping moved 22 spills inserting load/stores and 0 spills inserting push/pops BOLT-INFO: Shrink wrapping reduced 12285829 store executions (0.0% total instructions executed, 0.4% store instructions) BOLT-INFO: Shrink wrapping failed at reducing 0 store executions (0.0% total instructions executed, 0.0% store instructions) BOLT-INFO: Allocation combiner: 20 empty spaces coalesced (dyn count: 7116483). BOLT-INFO: padding code to 0x4200000 to accommodate hot text BOLT-INFO: setting __hot_start to 0x4000000 BOLT-INFO: setting __hot_end to 0x40e0719


3. Linux kernel compilation using the BOLT-ed binaries:

$ time make distclean defconfig all -j$(nproc --all) LLVM=1 LLVM_IAS=1

HOSTCC scripts/basic/fixdep HOSTCC scripts/kconfig/conf.o HOSTCC scripts/kconfig/confdata.o HOSTCC scripts/kconfig/expr.o LEX scripts/kconfig/lexer.lex.c YACC scripts/kconfig/parser.tab.[ch] HOSTCC scripts/kconfig/menu.o HOSTCC scripts/kconfig/preprocess.o HOSTCC scripts/kconfig/symbol.o HOSTCC scripts/kconfig/util.o HOSTCC scripts/kconfig/lexer.lex.o HOSTCC scripts/kconfig/parser.tab.o HOSTLD scripts/kconfig/conf *** Default configuration is based on 'x86_64_defconfig' #

configuration written to .config

# SYSHDR arch/x86/include/generated/uapi/asm/unistd_32.h SYSHDR arch/x86/include/generated/uapi/asm/unistd_64.h SYSHDR arch/x86/include/generated/uapi/asm/unistd_x32.h SYSTBL arch/x86/include/generated/asm/syscalls_32.h ... AS arch/x86/boot/compressed/piggy.o LD arch/x86/boot/compressed/vmlinux ZOFFSET arch/x86/boot/zoffset.h OBJCOPY arch/x86/boot/vmlinux.bin AS arch/x86/boot/header.o LD arch/x86/boot/setup.elf OBJCOPY arch/x86/boot/setup.bin BUILD arch/x86/boot/bzImage Kernel: arch/x86/boot/bzImage is ready (#1)

real 0m51.930s user 58m38.684s sys 5m33.364s



I tried `--tail-duplication=moderate` as well, but the binaries still segfault. I'm not sure why `--tail-duplication` breaks the final binaries, but it would be great if someone could point it out.
ms178 commented 1 year ago

@dakkshesh07 Just an observation, you use quite a few bolt configuration options. It seems that you hit a bug in BOLT, but I am not qualified to judge that or comment on that further. By the way, in my BOLTed LLVM/Clang I only use these and my last BOLTed Clang-16 revision (8ec0a369675b8406460fac1f94a6f2d84b7c0bb4) is able to compile the Linux Kernel and LLVM/Clang just fine.

Sidenote: Here are my stats which I got with training on LLVM/Clang itself, it seem to eliminate more taken conditional branches than yours, your options show advantages with the first three metrics though. My CPU is a Xeon E5-2696 V3 (Haswell-EP) which supports LBR sampling. BOLT_2023-01-04

Can you tell more a bit about your optimization strategy? I suppose faster Kernel compile times are your goal?

dakkshesh07 commented 1 year ago

@dakkshesh07 Just an observation, you use quite a few bolt configuration options. It seems that you hit a bug in BOLT, but I am not qualified to judge that or comment on that further. By the way, in my BOLTed LLVM/Clang I only use these and my last BOLTed Clang-16 revision (8ec0a36) is able to compile the Linux Kernel and LLVM/Clang just fine.

Sidenote: Here are my stats which I got with training on LLVM/Clang itself, it seem to eliminate more taken conditional branches than yours, your options show advantages with the first three metrics though. My CPU is a Xeon E5-2696 V3 (Haswell-EP) which supports LBR sampling. BOLT_2023-01-04

Can you tell more a bit about your optimization strategy? I suppose faster Kernel compile times are your goal?

Concerning the flags, I was only trying multiple flags to see which combination produced the best results and that's when I got into this problem. Some of these flags were taken from the cpython repo, while the rest were taken from the llvm-bolt help list. And yes, My goal is to reduce the compilation time as much as possible.

About your stats, were the profiles generated using instrumented binaries or sampling?

ms178 commented 1 year ago

@dakkshesh07 My PGO profile was gathered with instrumentation (with a training run on LLVM/Clang itself) while the BOLT profile was gathered with LBR sampling. Were your numbers gathered with an AMD CPU? As most AMD CPU's don't support LBR sampling, that would mean less-precise BOLT profiles and could be another explanation for your stats. You can have a look at the script I use which is in the repo under the link that I have given above. I also set some custom compiler flags in the training run to have better profile coverage for the flags which I also use with the Linux Kernel.

As I am open for suggestions for improvements, I am also looking forward for advice which strategies and flags yield the best results. I've taken these BOLT flags from the maintainer of CachyOS who did his own BOLT experiments and I solely refined the scripts a bit for my liking. From the documentation, I was left with the impression that the taken conditional branches metric at the bottomn is the most important metric to optimize for the best results, but I might be mistaken here.

aaupov commented 1 year ago

@dakkshesh07 thank you for the report. Tail duplication is an experimental optimization that was added recently. I would recommend disabling it until it's properly tested and recommended for use.

dakkshesh07 commented 1 year ago

@dakkshesh07 My PGO profile was gathered with instrumentation (with a training run on LLVM/Clang itself) while the BOLT profile was gathered with LBR sampling. Were your numbers gathered with an AMD CPU? As most AMD CPU's don't support LBR sampling, that would mean less-precise BOLT profiles and could be another explanation for your stats. You can have a look at the script I use which is in the repo under the link that I have given above. I also set some custom compiler flags in the training run to have better profile coverage for the flags which I also use with the Linux Kernel.

As I am open for suggestions for improvements, I am also looking forward for advice which strategies and flags yield the best results. I've taken these BOLT flags from the maintainer of CachyOS who did his own BOLT experiments and I solely refined the scripts a bit for my liking. From the documentation, I was left with the impression that the taken conditional branches metric at the bottomn is the most important metric to optimize for the best results, but I might be mistaken here.

Your BOLT profiles were acquired by sampling, which might explain the discrepancy in stats. I'll check into your script and attempt your BOLT configuration with my profiles to compare stats on my end as well. Though I am doing more compilation benchmarks than stat comparisons. I'll get back to you as soon as possible.

dakkshesh07 commented 1 year ago

@dakkshesh07 thank you for the report. Tail duplication is an experimental optimization that was added recently. I would recommend disabling it until it's properly tested and recommended for use.

Ah, I'll disable tail duplication in my script till the issue is resolved.