rui314 / mold

Mold: A Modern Linker 🦠
MIT License
14.17k stars 464 forks source link

Test with BOLT #789

Open rui314 opened 1 year ago

rui314 commented 1 year ago

Facebook/Meta's BOLT is becoming popular. I don't know if BOLT works on mold's output. We need to test it and fix issues if any.

ptr1337 commented 1 year ago

I will test tomorrow to build a clang toolchain with mold as linker, and then instrument and optimize it. I will report it here.

I just tried to BOLT mold itself, compiled it with --emit-relocs and -fno-reorder-blocks-and-partition. When trying to create the instrumented binary, I get a Segmentation fault.

❯ ./bolt-anything.bash
Instrument binary with llvm-bolt
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 39f01240e76678fa6385d36e7a96670677a26d65
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x1000000, offset 0x1000000
BOLT-INFO: enabling relocation mode
BOLT-INFO: forcing -jump-tables=move for instrumentation
BOLT-INFO: enabling -align-macro-fusion=all since no profile was specified
BOLT-INFO: enabling lite mode
/home/ptr1337/llvm-bolt-scripts/bolt-anything.bash: line 41: 37593 Segmentation fault      ${BOLTPATH}/llvm-bolt --instrument --instrumentation-file-append-pid --instrumentation-file=${FDATA}/${BINARY}.fdata ${BINARYPATH}/${BINARY} -o ${BOLTBIN}/${BINARY}
Could not create instrumented binary
gcflymoto commented 1 year ago

@rui314 we are also eagerly trying to combine MOLD + CLANG + PGO + LTO + BOLT... it's certainly a journey to get there for big apps.

rui314 commented 1 year ago

@ptr1337 Thanks! I'd start with a "hello world" program though.

rui314 commented 1 year ago

Here is a brain dump of things that might affect the compatibility with BOLT.

rui314 commented 1 year ago

@gcflymoto Is there anything we can help?

ptr1337 commented 1 year ago

@rui314

I think this could help you, I have just tested to bolt clang.

Instrument clang with llvm-bolt
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 39f01240e76678fa6385d36e7a96670677a26d65
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x4e00000, offset 0x4e00000
BOLT-INFO: enabling relocation mode
BOLT-INFO: forcing -jump-tables=move for instrumentation
BOLT-INFO: enabling -align-macro-fusion=all since no profile was specified
BOLT-INFO: enabling lite mode
BOLT-WARNING: _etext/1 (0x490955c) does not have any section
BOLT-WARNING: etext/1 (0x490955c) does not have any section
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
/home/ptr1337/llvm-bolt-scripts/build_stage3-bolt-without-sampling.bash: line 19: 69152 Aborted                 ${BOLTPATH}/llvm-bolt --instrument --instrumentation-file-append-pid --instrumentation-file=${TOPLEV}/stage3-without-sampling/intrumentdata
rui314 commented 1 year ago

@ishitatsuyuki Did you want to take this?

ishitatsuyuki commented 1 year ago

@rui314 Yes, I'm tentatively trying to get a Rust CI build to work, which uses BOLT to optimize LLVM (https://github.com/rust-lang/rust/pull/94381). I'll report if it succeeds or not.

ishitatsuyuki commented 1 year ago

OK, it failed:

BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: <unknown>
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x5000000, offset 0x5000000
BOLT-INFO: enabling relocation mode
BOLT-INFO: forcing -jump-tables=move for instrumentation
BOLT-INFO: enabling -align-macro-fusion=all since no profile was specified
BOLT-INFO: enabling lite mode
BOLT-WARNING: _etext/1 (0x4a4f843) does not have any section
BOLT-WARNING: etext/1 (0x4a4f843) does not have any section
BOLT-WARNING: split function detected on input : _ZNSi6ignoreEl.cold/1. The support is limited in relocation mode.
 #0 0x0000000000958a1f PrintStackTraceSignalHandler(void*) Signals.cpp:0:0
 #1 0x0000000000956814 SignalHandler(int) Signals.cpp:0:0
 #2 0x00007fef06d83db0 __restore_rt (/lib64/libc.so.6+0x59db0)
 #3 0x00007fef06e0ffc0 __memmove_avx_unaligned_erms (/lib64/libc.so.6+0xe5fc0)
 #4 0x000000000099ded9 void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>::_M_construct<char const*>(char const*, char const*, std::forward_iterator_tag) (.constprop.0) RewriteInstance.cpp:0:0
 #5 0x00000000009d959b llvm::bolt::RewriteInstance::analyzeRelocation(llvm::object::RelocationRef const&, unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>&, bool&, unsigned long&, long&, unsigned long&, bool&) const (/rustroot/bin/llvm-bolt+0x9d959b)
 #6 0x00000000009d9e6b llvm::bolt::RewriteInstance::readRelocations(llvm::object::SectionRef const&) (/rustroot/bin/llvm-bolt+0x9d9e6b)
 #7 0x00000000009dbdcd llvm::bolt::RewriteInstance::processRelocations() (.part.0) RewriteInstance.cpp:0:0
 #8 0x00000000009fa56a llvm::bolt::RewriteInstance::discoverFileObjects() (/rustroot/bin/llvm-bolt+0x9fa56a)
 #9 0x00000000009fbf49 llvm::bolt::RewriteInstance::run() (/rustroot/bin/llvm-bolt+0x9fbf49)
#10 0x0000000000419bab main (/rustroot/bin/llvm-bolt+0x419bab)
#11 0x00007fef06d6ee50 __libc_start_call_main (/lib64/libc.so.6+0x44e50)
#12 0x00007fef06d6eefc __libc_start_main@GLIBC_2.2.5 (/lib64/libc.so.6+0x44efc)
#13 0x000000000047baa5 _start (/rustroot/bin/llvm-bolt+0x47baa5)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: llvm-bolt -instrument /checkout/obj/build/x86_64-unknown-linux-gnu/llvm/build/lib/libLLVM-15-rust-1.66.0-nightly.so --instrumentation-file-append-pid -o /tmp/instrumented.so

I'll try to gather more notes for reproduction later, but let me know if you need something in particular.

ishitatsuyuki commented 1 year ago

Instrument runs fine with llvm main so it's some fun bug that got fixed between the 15.0 and 16.0 cycle...

rui314 commented 1 year ago

If we use LLVM 16, can we use BOLT with mold without doing anything special?

ishitatsuyuki commented 1 year ago

I still haven't tested the optimization pass (the part that is not instrumentation) yet.

ishitatsuyuki commented 1 year ago

I think I spoke too soon. Bisection resulted in an all-pass so this might be an environment dependent problem (build flags? silent memory errors?).

ishitatsuyuki commented 1 year ago

This line is failing, even in main: https://github.com/llvm/llvm-project/blob/3ee58e2f355f8fdb8e0fe29dc366c8833fafa7d3/bolt/lib/Rewrite/RewriteInstance.cpp#L1885

Lesson to me is to always turn on LLVM assertions when debugging. Without assertions this turned into a heisenbug.

I'll try to see why.

rui314 commented 1 year ago

I usually build LLVM (and mold) with -DCMAKE_RELEASE_TYPE=Debug unless I'm doing performance optimization.

ishitatsuyuki commented 1 year ago

Looks like this was the real problem:

llvm-bolt: /home/ishitatsuyuki/llvm-project/bolt/lib/Core/BinaryContext.cpp:753: llvm::bolt::BinaryFunction *llvm::bolt::BinaryContext::createBinaryFunction(const std::string &, llvm::bolt::BinarySection &, uint64_t, uint64_t, uint64_t, uint16_t): Assertion `Result.second == true && "unexpected duplicate function"' failed.
 #0 0x000055839b75737a llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/ishitatsuyuki/llvm-project/llvm/lib/Support/Unix/Signals.inc:569:11
 #1 0x000055839b75752b PrintStackTraceSignalHandler(void*) /home/ishitatsuyuki/llvm-project/llvm/lib/Support/Unix/Signals.inc:636:1
 #2 0x000055839b755b76 llvm::sys::RunSignalHandlers() /home/ishitatsuyuki/llvm-project/llvm/lib/Support/Signals.cpp:104:5
 #3 0x000055839b757c55 SignalHandler(int) /home/ishitatsuyuki/llvm-project/llvm/lib/Support/Unix/Signals.inc:407:1
 #4 0x00007fded65a2520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #5 0x00007fded65f6a7c __pthread_kill_implementation ./nptl/pthread_kill.c:44:76
 #6 0x00007fded65f6a7c __pthread_kill_internal ./nptl/pthread_kill.c:78:10
 #7 0x00007fded65f6a7c pthread_kill ./nptl/pthread_kill.c:89:10
 #8 0x00007fded65a2476 gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #9 0x00007fded65887f3 abort ./stdlib/abort.c:81:7
#10 0x00007fded658871b _nl_load_domain ./intl/loadmsgcat.c:1177:9
#11 0x00007fded6599e96 (/lib/x86_64-linux-gnu/libc.so.6+0x39e96)
#12 0x000055839d0e9105 llvm::bolt::BinaryContext::createBinaryFunction(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>> const&, llvm::bolt::BinarySection&, unsigned long, unsigned long, unsigned long, unsigned short) /home/ishitatsuyuki/llvm-project/bolt/lib/Core/BinaryContext.cpp:754:32
#13 0x000055839b872746 llvm::bolt::RewriteInstance::createPLTBinaryFunction(unsigned long, unsigned long, unsigned long) /home/ishitatsuyuki/llvm-project/bolt/lib/Rewrite/RewriteInstance.cpp:1268:6
#14 0x000055839b873002 llvm::bolt::RewriteInstance::disassemblePLTSectionX86(llvm::bolt::BinarySection&, unsigned long) /home/ishitatsuyuki/llvm-project/bolt/lib/Rewrite/RewriteInstance.cpp:1386:3
#15 0x000055839b873220 llvm::bolt::RewriteInstance::disassemblePLT()::$_9::operator()(llvm::bolt::BinarySection&, unsigned long) const /home/ishitatsuyuki/llvm-project/bolt/lib/Rewrite/RewriteInstance.cpp:1394:3
#16 0x000055839b871838 llvm::bolt::RewriteInstance::disassemblePLT() /home/ishitatsuyuki/llvm-project/bolt/lib/Rewrite/RewriteInstance.cpp:1401:5
#17 0x000055839b86c4dc llvm::bolt::RewriteInstance::discoverFileObjects() /home/ishitatsuyuki/llvm-project/bolt/lib/Rewrite/RewriteInstance.cpp:1173:3
#18 0x000055839b868952 llvm::bolt::RewriteInstance::run() /home/ishitatsuyuki/llvm-project/bolt/lib/Rewrite/RewriteInstance.cpp:720:3
#19 0x000055839ac9b76b main /home/ishitatsuyuki/llvm-project/bolt/tools/driver/llvm-bolt.cpp:244:17
#20 0x00007fded6589d90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#21 0x00007fded6589e40 call_init ./csu/../csu/libc-start.c:128:20
#22 0x00007fded6589e40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#23 0x000055839ac9a6d5 _start (./build/bin/llvm-bolt+0x2326d5)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: ./build/bin/llvm-bolt -instrument /home/ishitatsuyuki/rust/obj/build/x86_64-unknown-linux-gnu/llvm/build/lib/libLLVM-15-rust-1.66.0-nightly.so -o /tmp/out.so

Looks PLT related.

ishitatsuyuki commented 1 year ago

(Would love Rui's thoughts on this since I don't really know how mold's PLT differs from the psABI. FYI, the BOLT PLT scanner is here).

rui314 commented 1 year ago

Here is where we create a PLT header and PLT entries: https://github.com/rui314/mold/blob/main/elf/arch-x86-64.cc#L34-L75

The biggest difference compared to the one in the psABI is that our PLT entry has only one entry point. The standard PLT entry looks like this:

jmp *GOTPLT_ETNRY_FOR_THE_SYMBOL
push PLT_INDEX
jmp PLT_HDR

Initially, the .got.plt entry for symbol foo is set to the middle of foo's PLT entry, so that when the PLT entry is executed for the first time, it jumps back to the push instruction. The push instruction sets a PLT index and jump to the beginning of the PLT header for lazy symbol resolution.

Our PLT looks like this:

endbr64
mov PLT_INDEX, %r11
jmp *GOTPLT_ENTRY_FOR_THE_SYMBOL

Initially, unlike the standard PLT entries, all .got.plt entries point to the beginning of the PLT section. We unconditionally set %r11 to the PLT index. If the PLT entry is called for the first time, the PLT header uses %r11's value to know which PLT entry to fix. If a PLT entry is already resolved, it's a dead-store to %r11, but it should be harmless.

ishitatsuyuki commented 1 year ago

After a close look it seems that the PLT format is actually fine since BOLT sequentially looks for the first jmp, which has the same target.

It looks like this conflicts with preexisting local symbols like foo$plt though, which the assertion is complaining about: I need to understand why these exist and what are these supposed to be.

ishitatsuyuki commented 1 year ago

urgh, the foo$plt thing is a regression after 1.5.1. I need to bisect.

ishitatsuyuki commented 1 year ago

I'm dumb, it was literally implemented in a7f1e20594ac6f86982ae22979e58f8fcf810900 as a feature...

ishitatsuyuki commented 1 year ago

So, the PLT-related assertion is a bug in BOLT, but we might want to add a flag to disable the synthesized symbols, to allow workarounding this issue.

There seems to be an unrelated regression from a7f1e20 too: some symbol's visibility gets filled with garbage value like below:

Symbol table '.dynsym' contains 7 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND fprintf@GLIBC_2.2.5 (2)
     2: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND strtol@GLIBC_2.2.5 (2)
     3: 0000000000000000     0 OBJECT  GLOBAL DEFAULT  UND stdin@GLIBC_2.2.5 (2)
     4: 0000000000000000     0 OBJECT  GLOBAL DEFAULT  UND stderr@GLIBC_2.2.5 (2)
     5: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND fread@GLIBC_2.2.5 (2)
     6: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND feof@GLIBC_2.2.5 (2)

Symbol table '.symtab' contains 62 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000200238     0 SECTION LOCAL  DEFAULT    1 .hash
     2: 0000000000200278     0 SECTION LOCAL  DEFAULT    2 .gnu.hash
     3: 0000000000200298     0 SECTION LOCAL  DEFAULT    3 .dynsym
     4: 0000000000200340     0 SECTION LOCAL  DEFAULT    4 .dynstr
     5: 00000000002003ba     0 SECTION LOCAL  DEFAULT    5 .gnu.version
     6: 00000000002003c8     0 SECTION LOCAL  DEFAULT    6 .gnu.version_r
     7: 00000000002003e8     0 SECTION LOCAL  DEFAULT    7 .rela.dyn
     8: 0000000000200418     0 SECTION LOCAL  DEFAULT    8 .rela.plt
     9: 0000000000200478     0 SECTION LOCAL  DEFAULT    9 .eh_frame
    10: 00000000002004e4     0 SECTION LOCAL  DEFAULT   10 .eh_frame_hdr
    11: 0000000000200500     0 SECTION LOCAL  DEFAULT   11 .rodata.cst
    12: 0000000000200520     0 SECTION LOCAL  DEFAULT   12 .rodata.str
    13: 00000000002015a0     0 SECTION LOCAL  DEFAULT   13 .plt
    14: 0000000000201600     0 SECTION LOCAL  DEFAULT   14 .text
    15: 0000000000202800     0 SECTION LOCAL  DEFAULT   15 .dynamic
    16: 0000000000202a20     0 SECTION LOCAL  DEFAULT   16 .got
    17: 0000000000202a30     0 SECTION LOCAL  DEFAULT   17 .relro_padding
    18: 0000000000203a30     0 SECTION LOCAL  DEFAULT   18 .got.plt
    19: 0000000000202a20     0 OBJECT  LOCAL  DEFAULT   16 stdin@got
    20: 0000000000202a28     0 OBJECT  LOCAL  DEFAULT   16 stderr@got
    21: 00000000002015c0     0 FUNC    LOCAL  DEFAULT   13 fprintf@plt
    22: 00000000002015d0     0 FUNC    LOCAL  DEFAULT   13 strtol@plt
    23: 00000000002015e0     0 FUNC    LOCAL  DEFAULT   13 fread@plt
    24: 00000000002015f0     0 FUNC    LOCAL  DEFAULT   13 feof@plt
    25: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS count.c
    26: 0000000000200510     0 NOTYPE  LOCAL  DEFAULT   11 .LCPI0_0
    27: 0000000000200500     0 NOTYPE  LOCAL  DEFAULT   11 .LCPI0_1
    28: 0000000000200576     0 OBJECT  LOCAL  DEFAULT   12 .L.str
    29: 0000000000200520     0 OBJECT  LOCAL  DEFAULT   12 .L.str.2
    30: 000000000020055a     0 OBJECT  LOCAL  DEFAULT   12 .L.str.3
    31: 0000000000200539     0 OBJECT  LOCAL  DEFAULT   12 .L.str.1
    32: 0000000000201600   508 FUNC    LOCAL  DEFAULT   14 main
    33: 0000000000200000     0 NOTYPE  LOCAL  DEFAULT    1 __ehdr_start
    34: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT   21 __init_array_start
    35: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT   21 __init_array_end
    36: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT   21 __fini_array_start
    37: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT   21 __fini_array_end
    38: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT   21 __preinit_array_start
    39: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT   21 __preinit_array_end
    40: 0000000000202800     0 NOTYPE  LOCAL  DEFAULT   15 _DYNAMIC
    41: 0000000000203a30     0 NOTYPE  LOCAL  DEFAULT   18 _GLOBAL_OFFSET_TABLE_
    42: 00000000002015a0     0 NOTYPE  LOCAL  DEFAULT   13 _PROCEDURE_LINKAGE_TABLE_
    43: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT   21 __bss_start
    44: 0000000000203a68     0 NOTYPE  LOCAL  DEFAULT   18 _end
    45: 00000000002017fc     0 NOTYPE  LOCAL  DEFAULT   14 _etext
    46: 0000000000203a68     0 NOTYPE  LOCAL  DEFAULT   18 _edata
    47: 0000000000200000     0 NOTYPE  LOCAL  DEFAULT    1 __executable_start
    48: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT   21 __rela_iplt_start
    49: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT   21 __rela_iplt_end
    50: 00000000002004e4     0 NOTYPE  LOCAL  DEFAULT   10 __GNU_EH_FRAME_HDR
    51: 0000000000203a68     0 NOTYPE  LOCAL  DEFAULT   18 end
    52: 00000000002017fc     0 NOTYPE  LOCAL  DEFAULT   14 etext
    53: 0000000000203a68     0 NOTYPE  LOCAL  DEFAULT   18 edata
    54: 0000000000200000     0 NOTYPE  LOCAL  DEFAULT    1 __dso_handle
    55: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT    1 _TLS_MODULE_BASE_
    56: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT [<other>: 20]   UND fprintf
    57: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT [<other>: 34]   UND strtol
    58: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND stdin
    59: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND stderr
    60: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND fread
    61: 0000000000000000     0 NOTYPE  GLOBAL DEFAULT  UND feof

(the [<other>: xx] part)

A repro suite is attached. repro.tar.zip

rui314 commented 1 year ago

Instead of adding a command line option to mold to suppress the linker-synthesized symbols, I'd fix BOLT. BOLT should not be affected by these symbols.

rui314 commented 1 year ago

I can't reproduce the symbol visibility issue with git head, so maybe it's already fixed?

ishitatsuyuki commented 1 year ago

Instead of adding a command line option to mold to suppress the linker-synthesized symbols, I'd fix BOLT. BOLT should not be affected by these symbols.

Ack. Will file a bug report later.

I can't reproduce the symbol visibility issue with git head, so maybe it's already fixed?

Bisect says it's gone after https://github.com/rui314/mold/commit/d2b7b04be30b2459eceb75c3797e9725e1605f2c. But that commit doesn't really touch visibility, and it looks like the exact corrupted symbols differs between builds, so I wonder if there's some nondeterminism involved here.


After switching to v1.5.1 which doesn't emit the $plt symbols, BOLT is now complaining values calculated from relocations doesn't match the ones inside the executable:

llvm-bolt: /home/ishitatsuyuki/llvm-project/bolt/lib/Rewrite/RewriteInstance.cpp:1972: bool llvm::bolt::RewriteInstance::analyzeRelocation(const llvm::object::RelocationRef &, uint64_t, std::string &, bool &, uint64_t &, int64_t &, uint64_t &, bool &) const: Assertion `verifyExtractedValue() && "mismatched extracted relocation value"' failed.

readelf also complains about invalid symbol indices, so I'm suspecting --emit-relocs isn't working quite as expected. I'll report when I have a reduced reproduction sample.

readelf: Error:  bad symbol index: 0002f885 in reloc
readelf: Error:  bad symbol index: 0002f886 in reloc
readelf: Error:  bad symbol index: 0002f887 in reloc
readelf: Error:  bad symbol index: 0002f888 in reloc
readelf: Error:  bad symbol index: 0002f889 in reloc
readelf: Error:  bad symbol index: 0002f888 in reloc
readelf: Error:  bad symbol index: 0002f88a in reloc
readelf: Error:  bad symbol index: 0002f888 in reloc
readelf: Error:  bad symbol index: 0002f88b in reloc
...
ishitatsuyuki commented 1 year ago

https://github.com/rui314/mold/commit/1124322ce85a17bc137f804c63eace3b794aa31d helps: BOLT now progress through a few symbols before hitting assert, and readelf errors are down to 5 entries. However something still seems to be wrong.

Edit: I probably know why: https://github.com/rui314/mold/blob/288cd5b4007a0a20da9aab52ab56067ad46866e8/elf/passes.cc#L1063 seems to need an update too.

ishitatsuyuki commented 1 year ago

With #798 and a hack to disable $plt-like symbols (in https://github.com/ishitatsuyuki/mold/tree/bolt) BOLT is working flawlessly :tada: :tada: :tada:

I used Rust's PGO build pipeline, which has full ThinLTO+PGO+BOLT for its LLVM component. I had to do a few quick hacks, namely changing the base distribution to get some decent support of C++20. The diff is available at https://github.com/ishitatsuyuki/rust/tree/moldBOLT.

The resulting binary should also be running fine since the pipeline uses the just-built compiler to compile the ecosystem tools (Cargo, rust-analyzer etc.), but I'll just verify with a icache heatmap on my Intel laptop later, just in case.

rui314 commented 1 year ago

Great work! It's impressive that you identified and resolved the issue so quickly. Thank you!

ptr1337 commented 1 year ago

@rui314 @ishitatsuyuki

Great work! I just tested to instrument clang, and it has worked. Gathering now the profile, then the optimization process will be done. So far it seems to be working well.

BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 39f01240e76678fa6385d36e7a96670677a26d65
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x4e00000, offset 0x4e00000
BOLT-INFO: enabling relocation mode
BOLT-INFO: forcing -jump-tables=move for instrumentation
BOLT-INFO: enabling -align-macro-fusion=all since no profile was specified
BOLT-INFO: enabling lite mode
BOLT-WARNING: _etext/1 (0x490955c) does not have any section
BOLT-WARNING: etext/1 (0x490955c) does not have any section
BOLT-WARNING: Failed to analyze 470109 relocations
BOLT-WARNING: 6 collisions detected while hashing binary objects. Use -v=1 to see the list.
BOLT-INFO: 0 out of 97833 functions in the binary (0.0%) have non-empty execution profile
BOLT-INFO: the input contains 17357 (dynamic count : 0) opportunities for macro-fusion optimization that are going to be fixed
BOLT-INSTRUMENTER: Number of indirect call site descriptors: 28238
BOLT-INSTRUMENTER: Number of indirect call target descriptors: 94233
BOLT-INSTRUMENTER: Number of function descriptors: 94233
BOLT-INSTRUMENTER: Number of branch counters: 904843
BOLT-INSTRUMENTER: Number of ST leaf node counters: 443070
BOLT-INSTRUMENTER: Number of direct call counters: 0
BOLT-INSTRUMENTER: Total number of counters: 1347913
BOLT-INSTRUMENTER: Total size of counters: 10783304 bytes (static alloc memory)
BOLT-INSTRUMENTER: Total size of string table emitted: 11389554 bytes in file
BOLT-INSTRUMENTER: Total size of descriptors: 96355088 bytes in file
BOLT-INSTRUMENTER: Profile will be saved to file /home/ptr1337/toolchain/llvm/stage3-without-sampling/intrumentdata/clang-15.fdata
BOLT-INFO: 1134971 instructions were shortened
BOLT-INFO: removed 109 empty blocks
BOLT-INFO: merged 1 duplicate CFG edge
BOLT-INFO: UCE removed 838 blocks and 51318 bytes of code.
BOLT-INFO: SCTC: patched 0 tail calls (0 forward) tail calls (0 backward) from a total of 0 while removing 0 double jumps and removing 0 basic blocks totalling 0 bytes of code. CTCs total execution count is 0 and the number of times CTCs are taken is 0.
BOLT-INFO: output linked against instrumentation runtime library, lib entry point is 0xc3944c0
BOLT-INFO: clear procedure is 0xc38f360
BOLT-INFO: setting _end to 0xc514c9c

The only thing, which seems different then with lld to me is:

BOLT-WARNING: _etext/1 (0x490955c) does not have any section
BOLT-WARNING: etext/1 (0x490955c) does not have any section

Same as this, seems to me a bit high:

BOLT-WARNING: Failed to analyze 470109 relocations

Thanks!

edit:

Instrumentation, merging fdata and the optimization was successful! Here the log:

BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 39f01240e76678fa6385d36e7a96670677a26d65
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-WARNING: _etext/1 (0x490955c) does not have any section
BOLT-WARNING: etext/1 (0x490955c) does not have any section
BOLT-WARNING: Failed to analyze 470109 relocations
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-WARNING: 1 collisions detected while hashing binary objects. Use -v=1 to see the list.
BOLT-INFO: 10210 out of 97833 functions in the binary (10.4%) have non-empty execution profile
BOLT-INFO: 649 functions with profile could not be optimized
BOLT-WARNING: 5 (0.0% of all profiled) functions have invalid (possibly stale) profile. Use -report-stale to see the list.
BOLT-WARNING: 460262 out of 243184719344 samples in the binary (0.0%) belong to functions with invalid (possibly stale) profile.
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: the input contains 4674 (dynamic count : 3071330706) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 354969 instructions were shortened
BOLT-INFO: removed 851 empty blocks
BOLT-INFO: ICF folded 1478 out of 98109 functions in 4 passes. 0 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 166.21 KB of code space. Folded functions were called 3122835745 times based on profile.
BOLT-INFO: 31975 PLT calls in the binary were optimized.
BOLT-INFO: basic block reordering modified layout of 4922 functions (48.21% of profiled, 5.09% of total)
BOLT-INFO: UCE removed 1 blocks and 6 bytes of code.
BOLT-INFO: splitting separates 4610217 hot bytes from 3731975 cold bytes (55.26% of split functions is hot).
BOLT-INFO: 129 Functions were reordered by LoopInversionPass
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

        138489556757 : executed forward branches
         20942286787 : taken forward branches
         46989998144 : executed backward branches
         28605377103 : taken backward branches
          9641011040 : executed unconditional branches
         14576806007 : all function calls
          7293536017 : indirect calls
          3769079744 : PLT calls
       1222503556165 : executed instructions
        298324511206 : executed load instructions
        120845030651 : executed store instructions
                   0 : taken jump table branches
                   0 : taken unknown indirect branches
        195120565941 : total branches
         59188674930 : taken branches
        135931891011 : non-taken conditional branches
         49547663890 : taken conditional branches
        185479554901 : all conditional branches

        124400786448 : executed forward branches (-10.2%)
          8624897609 : taken forward branches (-58.8%)
         61078768453 : executed backward branches (+30.0%)
         28657226155 : taken backward branches (+0.2%)
          6073827710 : executed unconditional branches (-37.0%)
         10807726263 : all function calls (-25.9%)
          7293536017 : indirect calls (=)
                   0 : PLT calls (-100.0%)
       1209826144728 : executed instructions (-1.0%)
        298324511206 : executed load instructions (=)
        120845030651 : executed store instructions (=)
                   0 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
        191553382611 : total branches (-1.8%)
         43355951474 : taken branches (-26.7%)
        148197431137 : non-taken conditional branches (+9.0%)
         37282123764 : taken conditional branches (-24.8%)
        185479554901 : all conditional branches (=)

BOLT-INFO: SCTC: patched 39 tail calls (37 forward) tail calls (2 backward) from a total of 39 while removing 4 double jumps and removing 31 basic blocks totalling 155 bytes of code. CTCs total execution count is 2168704 and the number of times CTCs are taken is 1081892.
BOLT-INFO: setting _end to 0x619ea10
BOLT-INFO: setting __hot_start to 0x4e00000
BOLT-INFO: setting __hot_end to 0x57411d6
rui314 commented 1 year ago

@ishitatsuyuki You may want to test BOLT with a program that uses exceptions, as clang and (I believe) rustc don't use exceptions. That part might not be tested yet.

maksfb commented 1 year ago

Our .eh_frame section does not have a relocation section even if --emit-relocs is given. We synthesize .eh_frame, and we probably need to synthesize .rela.eh_frame too

BOLT doesn't need .rela.eh_frame as it creates internal relocations while parsing the section, i.e. kinda synthesizes .rela.eh_frame internally.

rui314 commented 1 year ago

BOLT doesn't need .rela.eh_frame as it creates internal relocations while parsing the section, i.e. kinda synthesizes .rela.eh_frame internally.

Oh cool, then we don't need to synthsize .rela.eh_frame. Exceptions could fail for other reasons, so it's nice to test it though.

ishitatsuyuki commented 1 year ago

The BOLT-WARNING: Failed to analyze 470109 relocations is because BOLT don't know how to read the GOT32 relocations; it's probably relaxed by bfd but not mold. It seems harmless so far but I haven't checked in detail.

rui314 commented 1 year ago

@ishitatsuyuki I'd file that to BOLT even if it's harmless.

ishitatsuyuki commented 1 year ago

@ishitatsuyuki You may want to test BOLT with a program that uses exceptions, as clang and (I believe) rustc don't use exceptions. That part might not be tested yet.

The real-life example of BOLT seems really scarce, so I don't have a good idea what to test. I'll run BOLT's own test suite in LLVM with mold which has some CFI tests.

@ishitatsuyuki I'd file that to BOLT even if it's harmless.

Ack.

rui314 commented 1 year ago

Maybe @maksfb has an idea as to what programs are suitable for testing BOLT?

maksfb commented 1 year ago

BOLT test suite has several runnable x86 tests for C++ exceptions. E.g. https://github.com/llvm/llvm-project/blob/main/bolt/test/runtime/X86/exceptions-run.test

For large open-source project with C++ exceptions, I can recommend trying HHVM (hhvm.com). I've also heard that ScyllaDB uses BOLT, but I have no first-hand experience with it.

ishitatsuyuki commented 1 year ago

I ran the BOLT test suite with mold (-DBOLT_LLD_EXE=/path/to/mold):

Failed Tests (10):

Logs for X86/dynrelocs.s:

BOLT-WARNING: _etext/1 (0x20139d) does not have any section
BOLT-WARNING: etext/1 (0x20139d) does not have any section
llvm-bolt: /home/ishitatsuyuki/llvm-project/bolt/lib/Rewrite/RewriteInstance.cpp:2595: void llvm::bolt::RewriteInstance::handleRelocation(const llvm::object::SectionRef &, const llvm::object::RelocationRef &): Assertion `(IsAArch64 || IsSectionRelocation || BD->nameStartsWith(SymbolName) || BD->nameStartsWith("PG" + SymbolName) || (BD->nameStartsWith("ANONYMOUS") && (BD->getSectionName().startswith(".plt") || BD->getSectionName().endswith(".plt")))) && "BOLT symbol names of all non-section relocations must match up " "with symbol names referenced in the relocation"' failed.
 #0 0x000056123b12837a llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /home/ishitatsuyuki/llvm-project/llvm/lib/Support/Unix/Signals.inc:569:11
 #1 0x000056123b12852b PrintStackTraceSignalHandler(void*) /home/ishitatsuyuki/llvm-project/llvm/lib/Support/Unix/Signals.inc:636:1
 #2 0x000056123b126b76 llvm::sys::RunSignalHandlers() /home/ishitatsuyuki/llvm-project/llvm/lib/Support/Signals.cpp:104:5
 #3 0x000056123b128c55 SignalHandler(int) /home/ishitatsuyuki/llvm-project/llvm/lib/Support/Unix/Signals.inc:407:1
 #4 0x00007ff3d0334520 (/lib/x86_64-linux-gnu/libc.so.6+0x42520)
 #5 0x00007ff3d0388a7c __pthread_kill_implementation ./nptl/./nptl/pthread_kill.c:44:76
 #6 0x00007ff3d0388a7c __pthread_kill_internal ./nptl/./nptl/pthread_kill.c:78:10
 #7 0x00007ff3d0388a7c pthread_kill ./nptl/./nptl/pthread_kill.c:89:10
 #8 0x00007ff3d0334476 gsignal ./signal/../sysdeps/posix/raise.c:27:6
 #9 0x00007ff3d031a7f3 abort ./stdlib/./stdlib/abort.c:81:7
#10 0x00007ff3d031a71b _nl_load_domain ./intl/./intl/loadmsgcat.c:1177:9
#11 0x00007ff3d032be96 (/lib/x86_64-linux-gnu/libc.so.6+0x39e96)
#12 0x000056123b1f03b4 llvm::bolt::RewriteInstance::handleRelocation(llvm::object::SectionRef const&, llvm::object::RelocationRef const&) /home/ishitatsuyuki/llvm-project/bolt/lib/Rewrite/RewriteInstance.cpp:2588:7
#13 0x000056123b1ed645 llvm::bolt::RewriteInstance::readRelocations(llvm::object::SectionRef const&) /home/ishitatsuyuki/llvm-project/bolt/lib/Rewrite/RewriteInstance.cpp:2329:33
#14 0x000056123b1ea978 llvm::bolt::RewriteInstance::processRelocations() /home/ishitatsuyuki/llvm-project/bolt/lib/Rewrite/RewriteInstance.cpp:2007:3
#15 0x000056123b1e507f llvm::bolt::RewriteInstance::discoverFileObjects() /home/ishitatsuyuki/llvm-project/bolt/lib/Rewrite/RewriteInstance.cpp:1240:1
#16 0x000056123b1e0ec2 llvm::bolt::RewriteInstance::run() /home/ishitatsuyuki/llvm-project/bolt/lib/Rewrite/RewriteInstance.cpp:720:3
#17 0x000056123a7340fb main /home/ishitatsuyuki/llvm-project/bolt/tools/driver/llvm-bolt.cpp:244:17
#18 0x00007ff3d031bd90 __libc_start_call_main ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
#19 0x00007ff3d031be40 call_init ./csu/../csu/libc-start.c:128:20
#20 0x00007ff3d031be40 __libc_start_main ./csu/../csu/libc-start.c:379:5
#21 0x000056123a733065 _start (/home/ishitatsuyuki/llvm-project/build/bin/llvm-bolt+0x260065)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
0.      Program arguments: /home/ishitatsuyuki/llvm-project/build/bin/llvm-bolt -data /home/ishitatsuyuki/llvm-project/build/tools/bolt/test/X86/Output/dynrelocs.s.tmp.fdata /home/ishitatsuyuki/llvm-project/build/tools/bolt/test/X86/Output/dynrelocs.s.tmp.exe -relocs -o /home/ishitatsuyuki/llvm-project/build/tools/bolt/test/X86/Output/dynrelocs.s.tmp.out -lite=0 -jump-tables=move
/home/ishitatsuyuki/llvm-project/build/tools/bolt/test/X86/Output/dynrelocs.s.script: line 8: 749269 Aborted                 (core dumped) /home/ishitatsuyuki/llvm-project/build/bin/llvm-bolt -data /home/ishitatsuyuki/llvm-project/build/tools/bolt/test/X86/Output/dynrelocs.s.tmp.fdata /home/ishitatsuyuki/llvm-project/build/tools/bolt/test/X86/Output/dynrelocs.s.tmp.exe -relocs -o /home/ishitatsuyuki/llvm-project/build/tools/bolt/test/X86/Output/dynrelocs.s.tmp.out -lite=0 -jump-tables=move
Testing Time: 22.61s
  Unsupported:  26
  Passed     : 194
  Failed     :  10
ishitatsuyuki commented 1 year ago

@rui314 Any idea where a GOT32 reloc can come from? You can see it in e.g. the emit-relocs test.

I readelf'd all the input .o and .a files but they didn't seem to contain any. A quick grep of mold code don't show any uses that assigns the value of GOT32, so I don't have a good clue here.

In bfd, the symbols seems to get assigned a PC64 reloc instead, although I'm not 100% sure it's pointing to the same thing.

Finally there's a chance that GOT32 is just a bogus value we assign due to some bug. I see a bunch of these for .debug_lines too.

rui314 commented 1 year ago

It looks like mold creates GOT32 relocations for section symbols. We discard section symbols in input files and re-create new section symbols for output sections, as we want to keep only one section symbol for each section.

I don't know why it lead to creating GOT32 relocations, though. But the relocation type looks bogus.

ishitatsuyuki commented 1 year ago

OK, this is the culprit then :sweat_smile:

buf[j].r_type = STT_SECTION;

(STT_SECTION = R_X86_64_GOT32)

rui314 commented 1 year ago

LOL

ishitatsuyuki commented 1 year ago

With #833 BOLT :: X86/data-to-data-pcrel.s is fixed. The remaining 9 still fails and there still seems to be "relocation parse failed" messages (Update: gone after rebuild). I'll check what BOLT is complaning about this time.

ishitatsuyuki commented 1 year ago

There are still 9 failures but I think we're mostly done for it now. Reasons it's failing seems to be due to one of: 1. mold defaulting to LOCAL for executables (need --export-dynamic), 2. mold generating more symbols that have the same address as others (probably harmless, just a different name), and 3. straight out hardcoded expectations. I'll test HHVM next time.