llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
28.22k stars 11.65k forks source link

LLVM-BOLT and glibc on x86_64 #85567

Open romanovj opened 6 months ago

romanovj commented 6 months ago

First error: no DT_FINI || DT_FINI_ARRAY. Fixed with creating it

static void fini(void) {}
__attribute__((section(".fini_array"), used)) static typeof(fini) *fini_p = fini;

Second error: can't disassemble function __longjmp_chk (asm code of this function in sysdeps/x86_64/longjmp.S) Fixed by replcaing asm code with nop =)

Third error: BOLT-ERROR: Offset overflow for dynamic relocation

BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 8c3b2f419a0eb5d5b4702568beee561b740e0b08
BOLT-INFO: first alloc address is 0x0
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-ERROR: function __restore_rt/1 is in conflict with FDE [398ff, 39909). Skipping.
BOLT-WARNING: sizes differ for function __setcontext/1. FDE : 125; symbol table : 145. Using max size.
BOLT-WARNING: sizes differ for function setcontext. FDE : 125; symbol table : 145. Using max size.
BOLT-WARNING: sizes differ for function __GI___clone/1. FDE : 48; symbol table : 91. Using max size.
BOLT-WARNING: sizes differ for function clone. FDE : 48; symbol table : 91. Using max size.
BOLT-WARNING: sizes differ for function __clone. FDE : 48; symbol table : 91. Using max size.
BOLT-WARNING: sizes differ for function __clone3/1. FDE : 23; symbol table : 67. Using max size.
BOLT-WARNING: sizes differ for function clone3/1. FDE : 23; symbol table : 67. Using max size.
BOLT-WARNING: sizes differ for function __GI___clone3/1. FDE : 23; symbol table : 67. Using max size.
BOLT-WARNING: FDE [0x3df4d, 0x3df61) conflicts with function __setcontext/1(*2)
BOLT-WARNING: FDE [0xfcc1a, 0xfcc2a) conflicts with function __GI___clone/1(*3)
BOLT-WARNING: FDE [0xfcc2a, 0xfcc3b) conflicts with function __GI___clone/1(*3)
BOLT-WARNING: FDE [0xfcd81, 0xfcd92) conflicts with function __clone3/1(*3)
BOLT-WARNING: FDE [0xfcd92, 0xfcda3) conflicts with function __clone3/1(*3)
BOLT-ERROR: symbol seen in the middle of the function __BOLT_FDE_FUNCat398ff. Skipping.
BOLT-INFO: pre-processing profile using branch profile reader
BOLT-INFO: forcing -jump-tables=move as PIC jump table was detected in function __printf_buffer_flush/1
BOLT-INFO: 321 out of 3749 functions in the binary (8.6%) have non-empty execution profile
BOLT-INFO: 10 functions with profile could not be optimized
BOLT-WARNING: 270 (84.1% of all profiled) functions have invalid (possibly stale) profile. Use -report-stale to see the list.
BOLT-WARNING: 1108082 out of 1109415 samples in the binary (99.9%) belong to functions with invalid (possibly stale) profile.
BOLT-INFO: profile for 1 objects was ignored
BOLT-INFO: the input contains 60 (dynamic count : 0) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 4846 instructions were shortened
BOLT-INFO: removed 1226 empty blocks
BOLT-INFO: removed 1 'repz' prefixes with estimated execution count of 0 times.
BOLT-INFO: basic block reordering modified layout of 13 functions (4.05% of profiled, 0.35% of total)
BOLT-INFO: UCE removed 1 blocks and 1 bytes of code
BOLT-INFO: splitting separates 3231 hot bytes from 11944 cold bytes (21.29% of split functions is hot).
BOLT-INFO: 1 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 330 to 306
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

                 325 : executed forward branches
                  34 : taken forward branches
                 197 : executed backward branches
                 133 : taken backward branches
                  20 : executed unconditional branches
               43301 : all function calls
                  16 : indirect calls
                   0 : PLT calls
              177808 : executed instructions
                1931 : executed load instructions
                 985 : executed store instructions
                   0 : taken jump table branches
                   0 : taken unknown indirect branches
                 542 : total branches
                 187 : taken branches
                 355 : non-taken conditional branches
                 167 : taken conditional branches
                 522 : all conditional branches

                 374 : executed forward branches (+15.1%)
                   5 : taken forward branches (-85.3%)
                 148 : executed backward branches (-24.9%)
                  99 : taken backward branches (-25.6%)
                  25 : executed unconditional branches (+25.0%)
               43301 : all function calls (=)
                  16 : indirect calls (=)
                   0 : PLT calls (=)
              177820 : executed instructions (+0.0%)
                1931 : executed load instructions (=)
                 985 : executed store instructions (=)
                   0 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
                 547 : total branches (+0.9%)
                 129 : taken branches (-31.0%)
                 418 : non-taken conditional branches (+17.7%)
                 104 : taken conditional branches (-37.7%)
                 522 : all conditional branches (=)

BOLT-INFO: SCTC: patched 8 tail calls (8 forward) tail calls (0 backward) from a total of 8 while removing 0 double jumps and removing 8 basic blocks totalling 40 bytes of code. CTCs total execution count is 0 and the number of times CTCs are taken is 0
BOLT-WARNING: failed to patch entries in __memmove_sse2_unaligned_erms/1(*2). The function will not be optimized.
BOLT-INFO: padding code to 0x400000 to accommodate hot text
BOLT-INFO: setting __hot_start to 0x200000
BOLT-INFO: setting __hot_end to 0x22d38c
BOLT-INFO: patched build-id (flipped last bit)
BOLT-ERROR: Offset overflow for dynamic relocation

I don't know how to fix it.

ms178 commented 6 months ago

I guess you use the highly experimental azanella/clang branch of glibc that works towards compatiblity with clang. I've tested that some weeks ago and found some bugs when using it, it isn't there just yet.

I can't tell myself if the issue you described here is a Clang, a BOLT or a Glibc problem, but maybe @zatrazz can whom I pinged for awareness and might have tried using BOLT on Glibc himself.

romanovj commented 6 months ago

@ms178 glibc 2.39, gcc 13.2.1, llvm-bolt 18.1.1

more details about second error:

BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: `8c3b2f419a0eb5d5b4702568beee561b740e0b08`
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x200000, offset 0x200000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: forcing -jump-tables=move for instrumentation
BOLT-INFO: enabling -align-macro-fusion=all since no profile was specified
BOLT-INFO: enabling lite mode
BOLT-WARNING: sizes differ for function __clone3/1. FDE : 27; symbol table : 71. Using max size.
BOLT-WARNING: sizes differ for function clone3/1. FDE : 27; symbol table : 71. Using max size.
BOLT-WARNING: sizes differ for function __GI___clone3/1. FDE : 27; symbol table : 71. Using max size.
BOLT-WARNING: sizes differ for function __GI___clone/1. FDE : 52; symbol table : 95. Using max size.
BOLT-WARNING: sizes differ for function clone. FDE : 52; symbol table : 95. Using max size.
BOLT-WARNING: sizes differ for function __clone. FDE : 52; symbol table : 95. Using max size.
BOLT-ERROR: function __restore_rt/1 is in conflict with FDE [9322f, 93239). Skipping.
BOLT-WARNING: sizes differ for function __setcontext/1. FDE : 332; symbol table : 352. Using max size.
BOLT-WARNING: sizes differ for function setcontext. FDE : 332; symbol table : 352. Using max size.
BOLT-WARNING: FDE [0x324e5, 0x324f6) conflicts with function __clone3/1(*3)
BOLT-WARNING: FDE [0x324f6, 0x32507) conflicts with function __clone3/1(*3)
BOLT-WARNING: FDE [0x3263e, 0x3264e) conflicts with function __GI___clone/1(*3)
BOLT-WARNING: FDE [0x3264e, 0x3265f) conflicts with function __GI___clone/1(*3)
BOLT-WARNING: FDE [0xe8eec, 0xe8f00) conflicts with function __setcontext/1(*2)
BOLT-ERROR: symbol seen in the middle of the function __BOLT_FDE_FUNCat9322f. Skipping.
BOLT-ERROR: cannot find BB containing branch destination.
=======================================
BOLT is unable to proceed because it couldn't properly understand this function.
If you are running the most recent version of BOLT, you may want to report this and paste this dump.
Please check that there is no sensitive contents being shared in this dump.

Offending function: ____longjmp_chk/1

Function contents (
  0000: F30F1EFA 4C8B4730 4C8B4F08 488B5738  |....L.G0L.O.H.W8|
  0010: 49C1C811 644C3304 25300000 0049C1C9  |I...dL3.%0...I..|
  0020: 11644C33 0C253000 000048C1 CA116448  |.dL3.%0...H...dH|
  0030: 33142530 0000004C 39C47651 4989FA89  |3.%0...L9.vQI...|
  0040: F331FF48 8D7424E8 B8830000 000F0585  |.1.H.t$.........|
  0050: C07535F7 4424F001 00000074 14488B44  |.u5.D$.....t.H.D|
  0060: 24E84803 4424F84C 29C0483B 4424F873  |$.H.D$.L).H;D$.s|
  0070: 174883EC 08488D3D 058D0E00 E88FEFFB  |.H...H.=........|
  0080: FF050505 05050505 4C89D789 DE64F704  |........L....d..|
  0090: 25480000 00020000 00745EF3 480F1EC8  |%H.......t^.H...|
  00A0: 4989C248 8B4F5848 29C8744D 4989CB48  |I..H.OXH).tMI..H|
  00B0: 8B59F848 83E3F848 39CB740B 4883E908  |.Y.H...H9.t.H...|
  00C0: 4939CA75 EAEB11F3 0F0169F8 F30F01EA  |I9.u......i.....|
  00D0: F3480F1E C84C29D8 48F7D848 C1E80348  |.H...L).H..H...H|
  00E0: 83C001BB FF000000 4839D848 0F42D8F3  |........H9.H.B..|
  00F0: 480FAEEB 4829D877 EF90488B 1F4C8B67  |H...H).w..H..L.g|
  0100: 104C8B6F 184C8B77 204C8B7F 2889F04C  |.L.o.L.w L..(..L|
  0110: 89C44C89 CD90FFE2 0F1F8400 00000000  |..L.............|
)

Binary Function "____longjmp_chk/1"  {
  Number      : 1596
  State       : disassembled
  Address     : 0x97450
  Size        : 0x118
  MaxSize     : 0x120
  Offset      : 0x97450
  Section     : .text
  Orc Section : .local.text.____longjmp_chk/1
  LSDA        : 0x0
  IsSimple    : 1
  IsMultiEntry: 0
  IsSplit     : 0
  BB Count    : 14
  CFI Instrs  : 16
}
DWARF CFI Instructions:
    0000003f:   OpRegister Reg5 Reg10
    00000041:   OpRegister Reg4 Reg3
    00000075:   OpRememberState
    00000075:   OpDefCfaOffset 16
    00000081:   OpRestoreState
    0000008b:   OpRestore Reg5
    0000008d:   OpRestore Reg4
    000000f9:   OpDefCfa Reg5 0
    000000f9:   OpRegister Reg7 Reg8
    000000f9:   OpRegister Reg6 Reg9
    000000f9:   OpRegister Reg16 Reg1
    000000f9:   OpOffset Reg3 0
    000000f9:   OpOffset Reg12 16
    000000f9:   OpOffset Reg13 24
    000000f9:   OpOffset Reg14 32
    000000f9:   OpOffset Reg15 40
End of Function "____longjmp_chk/1"

ERROR: disassembly failed - inconsistent branch found.
=======================================
LLVM ERROR: pthread_join failed: Resource deadlock avoided
 #0 0x000057fcfab0d5c0 (/usr/bin/llvm-bolt+0x16ce5c0)
 #1 0x000057fcfab0b240 (/usr/bin/llvm-bolt+0x16cc240)
 #2 0x000057fcfab0de7b (/usr/bin/llvm-bolt+0x16cee7b)
 #3 0x00007754026ed980 (/usr/bin/../lib/libc.so.6+0x39980)
 #4 0x000077540273b0ec (/usr/bin/../lib/libc.so.6+0x870ec)
 #5 0x00007754026ed8e4 __GI_raise (/usr/bin/../lib/libc.so.6+0x398e4)
 #6 0x00007754026d830b __GI_abort (/usr/bin/../lib/libc.so.6+0x2430b)
 #7 0x000057fcfaab9f3c (/usr/bin/llvm-bolt+0x167af3c)
 #8 0x000057fcfab0eb05 (/usr/bin/llvm-bolt+0x16cfb05)
 #9 0x000057fcfab0eb31 (/usr/bin/llvm-bolt+0x16cfb31)
#10 0x000057fcfabe56d6 (/usr/bin/llvm-bolt+0x17a66d6)
#11 0x000057fcfb13625c (/usr/bin/llvm-bolt+0x1cf725c)
#12 0x00007754026ef924 __run_exit_handlers (/usr/bin/../lib/libc.so.6+0x3b924)
#13 0x00007754026efa6a (/usr/bin/../lib/libc.so.6+0x3ba6a)
#14 0x000057fcfb0d32aa (/usr/bin/llvm-bolt+0x1c942aa)
#15 0x000057fcfb0f4716 (/usr/bin/llvm-bolt+0x1cb5716)
#16 0x000057fcfabb7871 (/usr/bin/llvm-bolt+0x1778871)
#17 0x000057fcfb13715f (/usr/bin/llvm-bolt+0x1cf815f)
#18 0x000057fcfabdccb6 (/usr/bin/llvm-bolt+0x179dcb6)
#19 0x000057fcfabe514f (/usr/bin/llvm-bolt+0x17a614f)
#20 0x000057fcfabe5b3b (/usr/bin/llvm-bolt+0x17a6b3b)
#21 0x00007754027394ca (/usr/bin/../lib/libc.so.6+0x854ca)
#22 0x00007754027b0e08 __GI___clone3 (/usr/bin/../lib/libc.so.6+0xfce08)
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
romanovj commented 6 months ago

after editing glibc/sysdeps/x86_64/longjmp.S (replcaing asm code for longjmp with nop)

llvm-bolt libc.so -o libc.so.bolt --instrument --instrumentation-file=/tmp/libc.so --instrumentation-file-append-pid

BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 8c3b2f419a0eb5d5b4702568beee561b740e0b08
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0x200000, offset 0x200000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: forcing -jump-tables=move for instrumentation
BOLT-INFO: enabling -align-macro-fusion=all since no profile was specified
BOLT-INFO: enabling lite mode
BOLT-ERROR: function __restore_rt/1 is in conflict with FDE [3b19f, 3b1a9). Skipping.
BOLT-WARNING: sizes differ for function __setcontext/1. FDE : 332; symbol table : 352. Using max size.
BOLT-WARNING: sizes differ for function setcontext. FDE : 332; symbol table : 352. Using max size.
BOLT-WARNING: sizes differ for function __GI___clone/1. FDE : 52; symbol table : 95. Using max size.
BOLT-WARNING: sizes differ for function clone. FDE : 52; symbol table : 95. Using max size.
BOLT-WARNING: sizes differ for function __clone. FDE : 52; symbol table : 95. Using max size.
BOLT-WARNING: sizes differ for function __clone3/1. FDE : 27; symbol table : 71. Using max size.
BOLT-WARNING: sizes differ for function clone3/1. FDE : 27; symbol table : 71. Using max size.
BOLT-WARNING: sizes differ for function __GI___clone3/1. FDE : 27; symbol table : 71. Using max size.
BOLT-WARNING: FDE [0x4000c, 0x40020) conflicts with function __setcontext/1(*2)
BOLT-WARNING: FDE [0x106ece, 0x106ede) conflicts with function __GI___clone/1(*3)
BOLT-WARNING: FDE [0x106ede, 0x106eef) conflicts with function __GI___clone/1(*3)
BOLT-WARNING: FDE [0x107055, 0x107066) conflicts with function __clone3/1(*3)
BOLT-WARNING: FDE [0x107066, 0x107077) conflicts with function __clone3/1(*3)
BOLT-ERROR: symbol seen in the middle of the function __BOLT_FDE_FUNCat3b19f. Skipping.
BOLT-INFO: 0 out of 3757 functions in the binary (0.0%) have non-empty execution profile
BOLT-INFO: the input contains 468 (dynamic count : 0) opportunities for macro-fusion optimization that are going to be fixed
BOLT-INSTRUMENTER: Number of indirect call site descriptors: 2833
BOLT-INSTRUMENTER: Number of indirect call target descriptors: 3733
BOLT-INSTRUMENTER: Number of function descriptors: 3691
BOLT-INSTRUMENTER: Number of branch counters: 40615
BOLT-INSTRUMENTER: Number of ST leaf node counters: 28597
BOLT-INSTRUMENTER: Number of direct call counters: 352
BOLT-INSTRUMENTER: Total number of counters: 69564
BOLT-INSTRUMENTER: Total size of counters: 556512 bytes (static alloc memory)
BOLT-INSTRUMENTER: Total size of string table emitted: 71639 bytes in file
BOLT-INSTRUMENTER: Total size of descriptors: 3586856 bytes in file
BOLT-INSTRUMENTER: Profile will be saved to file /tmp/libc.so
BOLT-INFO: 27956 instructions were shortened
BOLT-INFO: removed 551 empty blocks
BOLT-INFO: merged 3 duplicate CFG edges
BOLT-INFO: removed 1 'repz' prefixes with estimated execution count of 0 times.
BOLT-INFO: UCE removed 8425 blocks and 550438 bytes of code
BOLT-INFO: padding code to 0x800000 to accommodate hot text
BOLT-INFO: output linked against instrumentation runtime library, lib entry point is 0x8dd9d0
BOLT-INFO: clear procedure is 0x8d9350
BOLT-INFO: setting __bolt_runtime_start to 0x8dd980
BOLT-INFO: setting __bolt_runtime_fini to 0x8dd9d0
BOLT-INFO: setting __hot_start to 0x400000
BOLT-INFO: setting __hot_end to 0x702e88
BOLT-INFO: patched build-id (flipped last bit)
BOLT-ERROR: Offset overflow for dynamic relocation

libc.so.gz

ms178 commented 6 months ago

@romanovj I am just an interested user with no programming skills, I might not be of much help in debugging this further and leave that to the experts.

But you mentioned using GCC 13.2.1, llvm-bolt 18.1.1 and glibc 2.39 - you could try to get further with the mentioned azanella/clang branch, if you are on Arch Linux the source section in the PKGBUILD needs to be modified to: source=("git+https://sourceware.org/git/glibc.git#branch=azanella/clang"

With that tree you could even try to use Clang as the main compiler. But be aware that there are still some issues with that branch.

aaupov commented 6 months ago

I would suggest using bughunter script to try to track down the function that has this dynamic relocation (see example in https://llvm.org/devmtg/2024-03/slides/practical-use-of-bolt.pdf). If successful, you can skip that function with -skip-funcs=<funcname>.*.

If bughunter in function searching mode wouldn't help, you can try -max-data-relocations but I'm not entirely sure if that dynamic relocation is for code or data. But please check and let us know.

romanovj commented 6 months ago

@aaupov Nothing was found with bughunter.

This didn't help either-max-funcs=0 -max-data-relocations=0

zatrazz commented 6 months ago

I guess you use the highly experimental azanella/clang branch of glibc that works towards compatiblity with clang. I've tested that some weeks ago and found some bugs when using it, it isn't there just yet.

I would be interested to know what kind of bugs you have found using my branch. Are these bolt, clang, or glibc related?

I can't tell myself if the issue you described here is a Clang, a BOLT or a Glibc problem, but maybe @zatrazz can whom I pinged for awareness and might have tried using BOLT on Glibc himself.

I haven't tried to use yet, and I am not sure how to interpret the potential issues the log is showing. The 'Offset overflow for dynamic relocation' error for __longjmp.S should not be affect by the compiler, so I am not sure why kind of error BOLT is accusing here. We did have some tests that stress the symbols and on dynamic loader to acuse of potential overflow on dynamic relocations.

The reporter seems ot be using a quite recent glibc version, is he using the new '-z mark-plt'? I recall that we recently fixes some for displacement overflow, albeit it is only for x32.

ms178 commented 6 months ago

I guess you use the highly experimental azanella/clang branch of glibc that works towards compatiblity with clang. I've tested that some weeks ago and found some bugs when using it, it isn't there just yet.

I would be interested to know what kind of bugs you have found using my branch. Are these bolt, clang, or glibc related?

I will try to reproduce it with a recent snapshot. It could have been a mold issue actually: https://github.com/rui314/mold/issues/1213. I also use some Clear Linux and Mandriva patches on top and one patch did not applying cleanly on your glibc-git based branch which might have caused some differences.

ms178 commented 6 months ago

Indeed, it seems to be a mold 2.4.1 specific issue, with a clang-compiled glibc, I saw:

mold: warning: /usr/lib/crt1.o: ignoring .llvm_addrsig section without sh_link; was the file processed by strip or objcopy -r?
mold: warning: /usr/lib/libc_nonshared.a(atexit.oS): ignoring .llvm_addrsig section without sh_link; was the file processed by strip or objcopy -r?
mold: warning: /usr/lib/libc_nonshared.a(pthread_atfork.oS): ignoring .llvm_addrsig section without sh_link; was the file processed by strip or objcopy -r?
mold: warning: /usr/lib/libc_nonshared.a(stack_chk_fail_local.oS): ignoring .llvm_addrsig section without sh_link; was the file processed by strip or objcopy -r?
mold: warning: /usr/lib/libc_nonshared.a(at_quick_exit.oS): ignoring .llvm_addrsig section without sh_link; was the file processed by strip or objcopy -r?

But with mold 2.30 these warnings were silenced and don't occur any longer.

The other bugs which I've noticed were build issues due to me using some fancy compiler flags with LLVM/Clang. If you want I could list some of the offending flags here (as I don't have an account for glibc bugzilla).

zatrazz commented 6 months ago

Right, I don't have mold on my loop, I constantly test my branch with binutils and lld. And yes, it would be helpful to know the possible flags that might interfere with glibc build. Keep in mind that we have some strickly compiler flags requirements and we use/filter some depending of the TU (for instance, -fstack-protector where some loader TU can not be built with it because the SSP cookie is not initialized yet).

ms178 commented 6 months ago

@zatrazz Here are some examples: -fno-semantic-interposition and -Wl,-Bsymbolic-functions lead to segfaults and runtime issues but that's also the case when using GCC.

-fdata-sections -ffunction-sections and -Wl,--gc-sections on the linker side lead to an error during the configure stage: configure: error: --enable-multi-arch support requires assembler and linker support Here is a link to the full glibc configuration that I use (with some OpenMandriva and Clear Linux patches on top): https://github.com/ms178/archpkgbuilds/blob/main/toolchain-stable/glibc/PKGBUILD.clang

The following flags...

export CC=clang
export CXX=clang++
export CC_LD=lld
export CXX_LD=lld
export AR=llvm-ar
export NM=llvm-nm
export STRIP=llvm-strip
export OBJCOPY=llvm-objcopy
export OBJDUMP=llvm-objdump
export READELF=llvm-readelf
export RANLIB=llvm-ranlib
export HOSTCC=clang
export HOSTCXX=clang++
export HOSTAR=llvm-ar
export CPPFLAGS="-D_FORTIFY_SOURCE=0"
export CFLAGS="-O3 -march=native -mtune=native -mllvm -inline-threshold=1500 -mllvm -extra-vectorizer-passes -mllvm -enable-cond-stores-vec -mllvm -slp-vectorize-hor-store -mllvm -enable-loopinterchange -mllvm -enable-loop-distribute -mllvm -enable-unroll-and-jam -mllvm -enable-loop-flatten -mllvm -unroll-runtime-multi-exit -mllvm -aggressive-ext-opt -mllvm -enable-interleaved-mem-accesses -mllvm -enable-masked-interleaved-mem-accesses -fno-math-errno -fno-trapping-math -falign-functions=32 -funroll-loops -fcf-protection=none -mharden-sls=none -fomit-frame-pointer -mprefer-vector-width=256 -mllvm -adce-remove-loops -mllvm -enable-ext-tsp-block-placement -mllvm -enable-gvn-hoist -mllvm -enable-dfa-jump-thread -Wno-error -ffp-contract=fast -fsplit-machine-functions -fgnuc-version=6.5.0 -w"
export CXXFLAGS="${CFLAGS} -Wp,-U_GLIBCXX_ASSERTIONS"
export LDFLAGS="-Wl,--lto-CGO3 -Wl,--icf=all -Wl,--lto-O3,-O3,--as-needed -fcf-protection=none -mharden-sls=none -Wl,-mllvm -Wl,-extra-vectorizer-passes -Wl,-mllvm -Wl,-enable-cond-stores-vec -Wl,-mllvm -Wl,-slp-vectorize-hor-store -Wl,-mllvm -Wl,-enable-loopinterchange -Wl,-mllvm -Wl,-enable-loop-distribute -Wl,-mllvm -Wl,-enable-unroll-and-jam -Wl,-mllvm -Wl,-enable-loop-flatten -Wl,-mllvm -Wl,-unroll-runtime-multi-exit -Wl,-mllvm -Wl,-aggressive-ext-opt -Wl,-mllvm -Wl,-enable-interleaved-mem-accesses -Wl,-mllvm -Wl,-enable-masked-interleaved-mem-accesses -march=native -maes -mbmi2 -mpclmul -fuse-ld=lld -Wl,-zmax-page-size=0x200000 -Wl,-mllvm -Wl,-adce-remove-loops -Wl,-mllvm -Wl,-enable-ext-tsp-block-placement -Wl,-mllvm -Wl,-enable-gvn-hoist -Wl,-mllvm -Wl,-enable-dfa-jump-thread -Wl,--push-state -Wl,-whole-archive -ljemalloc_pic -Wl,--pop-state -lpthread -lstdc++ -lm -ldl"
export CCLDFLAGS="$LDFLAGS"
export CXXLDFLAGS="$LDFLAGS"
export ASFLAGS="-D__AVX__=1 -D__AVX2__=1 -D__FMA__=1"

... lead to a configure warning:

checking for assembler and linker STT_GNU_IFUNC support... llvm-readelf: warning: 'conftest': unable to parse DT_JMPREL: virtual address is not in any segment: 0x0
llvm-readelf: warning: 'conftest': unable to parse DT_JMPREL: virtual address is not in any segment: 0x0
yes

On the other hand, I am able to use fairly aggressive flags for the math part (see https://github.com/ms178/archpkgbuilds/blob/main/toolchain-stable/glibc/mathlto.patch.clang) and the following set of flags in /etc/makepkg.conf:

export CC=clang
export CXX=clang++
export CC_LD=lld
export CXX_LD=lld
export AR=llvm-ar
export NM=llvm-nm
export STRIP=llvm-strip
export OBJCOPY=llvm-objcopy
export OBJDUMP=llvm-objdump
export READELF=llvm-readelf
export RANLIB=llvm-ranlib
export HOSTCC=clang
export HOSTCXX=clang++
export HOSTAR=llvm-ar
export CPPFLAGS="-D_FORTIFY_SOURCE=0"
export CFLAGS="-O3 -march=native -mtune=native -maes -mbmi2 -mpclmul -mllvm -inline-threshold=1500 -mllvm -extra-vectorizer-passes -mllvm -enable-cond-stores-vec -mllvm -slp-vectorize-hor-store -mllvm -enable-loopinterchange -mllvm -enable-loop-distribute -mllvm -enable-unroll-and-jam -mllvm -enable-loop-flatten -mllvm -unroll-runtime-multi-exit -mllvm -aggressive-ext-opt -mllvm -enable-interleaved-mem-accesses -mllvm -enable-masked-interleaved-mem-accesses -fno-math-errno -fno-trapping-math -falign-functions=32 -funroll-loops -fomit-frame-pointer -mprefer-vector-width=256 -mllvm -adce-remove-loops -mllvm -enable-ext-tsp-block-placement -mllvm -enable-gvn-hoist -mllvm -enable-dfa-jump-thread -fcf-protection=none -mharden-sls=none -fgnuc-version=6.5.0"
export CXXFLAGS="${CFLAGS} -Wp,-U_GLIBCXX_ASSERTIONS"
export LDFLAGS="-Wl,-O3,--as-needed -Wl,-mllvm -Wl,-extra-vectorizer-passes -Wl,-mllvm -Wl,-enable-cond-stores-vec -Wl,-mllvm -Wl,-slp-vectorize-hor-store -Wl,-mllvm -Wl,-enable-loopinterchange -Wl,-mllvm -Wl,-enable-loop-distribute -Wl,-mllvm -Wl,-enable-unroll-and-jam -Wl,-mllvm -Wl,-enable-loop-flatten -Wl,-mllvm -Wl,-unroll-runtime-multi-exit -Wl,-mllvm -Wl,-aggressive-ext-opt -Wl,-mllvm -Wl,-enable-interleaved-mem-accesses -Wl,-mllvm -Wl,-enable-masked-interleaved-mem-accesses -march=native -maes -mbmi2 -mpclmul -fuse-ld=lld -Wl,-zmax-page-size=0x200000 -Wl,-mllvm -Wl,-adce-remove-loops -Wl,-mllvm -Wl,-enable-ext-tsp-block-placement -Wl,-mllvm -Wl,-enable-gvn-hoist -Wl,-mllvm -Wl,-enable-dfa-jump-thread -Wl,--undefined-version -fcf-protection=none -mharden-sls=none"
export CCLDFLAGS="$LDFLAGS"
export CXXLDFLAGS="$LDFLAGS"
export ASFLAGS="-D__AVX__=1 -D__AVX2__=1 -D__FMA__=1"
zatrazz commented 6 months ago

@zatrazz Here are some examples: -fno-semantic-interposition and -Wl,-Bsymbolic-functions lead to segfaults and runtime issues but that's also the case when using GCC.

Thanks, it a really interesting testcase you have here.

There is no need to use -fno-semantic-interposition or -Wl,-Bsymbolic-functions, glibc takes care to not add intra PLT calls with a set on internal tricks (hidden_proto/hidden_def macros), and it also has regressions tests to check for the unexpected cases. In fact I think this would be wrong because it would require to add a dynamic symbol file to export some symbols that expected to be called through PLT (like malloc, matherr, and __tls_get_addr).

-fdata-sections -ffunction-sections and -Wl,--gc-sections on the linker side lead to an error during the configure stage: configure: error: --enable-multi-arch support requires assembler and linker support Here is a link to the full glibc configuration that I use (with some OpenMandriva and Clear Linux patches on top): https://github.com/ms178/archpkgbuilds/blob/main/toolchain-stable/glibc/PKGBUILD.clang

I think the problem is passing such options through $CC and not through $CFLAGS. Using on CFLAGS/CXXFLAGS I could build with both gcc and clang without any issue (you will need to pass an optimization level though, due the loader bootstrap limitation).

Also the '--without-cvs', '--disable-dependency-tracking', '--disable-silent-rules', '--enable-omitfp', '--enable-nss-crypt', '--disable-sanity-checks' are outdate/inexistent options.

The following flags...


export CC=clang
export CXX=clang++
export CC_LD=lld
export CXX_LD=lld
export AR=llvm-ar
export NM=llvm-nm
export STRIP=llvm-strip
export OBJCOPY=llvm-objcopy
export OBJDUMP=llvm-objdump
export READELF=llvm-readelf
export RANLIB=llvm-ranlib
export HOSTCC=clang
export HOSTCXX=clang++
export HOSTAR=llvm-ar
export CPPFLAGS="-D_FORTIFY_SOURCE=0"
export CFLAGS="-O3 -march=native -mtune=native -mllvm -inline-threshold=1500 -mllvm -extra-vectorizer-passes -mllvm -enable-cond-stores-vec -mllvm -slp-vectorize-hor-store -mllvm -enable-loopinterchange -mllvm -enable-loop-distribute -mllvm -enable-unroll-and-jam -mllvm -enable-loop-flatten -mllvm -unroll-runtime-multi-exit -mllvm -aggressive-ext-opt -mllvm -enable-interleaved-mem-accesses -mllvm -enable-masked-interleaved-mem-accesses -fno-math-errno -fno-trapping-math -falign-functions=32 -funroll-loops -fcf-protection=none -mharden-sls=none -fomit-frame-pointer -mprefer-vector-width=256 -mllvm -adce-remove-loops -mllvm -enable-ext-tsp-block-placement -mllvm -enable-gvn-hoist -mllvm -enable-dfa-jump-thread -Wno-error -ffp-contract=fast -fsplit-machine-functions -fgnuc-version=6.5.0 -w"
export CXXFLAGS="${CFLAGS} -Wp,-U_GLIBCXX_ASSERTIONS"
export LDFLAGS="-Wl,--lto-CGO3 -Wl,--icf=all -Wl,--lto-O3,-O3,--as-needed -fcf-protection=none -mharden-sls=none -Wl,-mllvm -Wl,-extra-vectorizer-passes -Wl,-mllvm -Wl,-enable-cond-stores-vec -Wl,-mllvm -Wl,-slp-vectorize-hor-store -Wl,-mllvm -Wl,-enable-loopinterchange -Wl,-mllvm -Wl,-enable-loop-distribute -Wl,-mllvm -Wl,-enable-unroll-and-jam -Wl,-mllvm -Wl,-enable-loop-flatten -Wl,-mllvm -Wl,-unroll-runtime-multi-exit -Wl,-mllvm -Wl,-aggressive-ext-opt -Wl,-mllvm -Wl,-enable-interleaved-mem-accesses -Wl,-mllvm -Wl,-enable-masked-interleaved-mem-accesses -march=native -maes -mbmi2 -mpclmul -fuse-ld=lld -Wl,-zmax-page-size=0x200000 -Wl,-mllvm -Wl,-adce-remove-loops -Wl,-mllvm -Wl,-enable-ext-tsp-block-placement -Wl,-mllvm -Wl,-enable-gvn-hoist -Wl,-mllvm -Wl,-enable-dfa-jump-thread -Wl,--push-state -Wl,-whole-archive -ljemalloc_pic -Wl,--pop-state -lpthread -lstdc++ -lm -ldl"

Some options are not really tested, but for most I won't expected failures if compiler does not change the ABI (such as -Wl,-slp-vectorize-hor-store). However some of them I don't expected to be supported, not without a lot of hacks, such as LTO (https://sourceware.org/bugzilla/show_bug.cgi?id=15658); or adding a malloc implementation with a specific ABI (the jemalloc_pic) along with statically linking libc with libstdc++.

Also, the math library is build and tested with some especifc math flags (-ffp-contract=fast/-fno-trapping-math is not supported and might lead to a lot of regression is testing).

Could you check with a more restricted CFLAGS to narrow down the required support to enable BOLT? Trying to support such extensive flags selection might require a lot of extra unrelated work.

export CCLDFLAGS="$LDFLAGS" export CXXLDFLAGS="$LDFLAGS" export ASFLAGS="-DAVX=1 -DAVX2=1 -DFMA=1"


... lead to a configure warning:

checking for assembler and linker STT_GNU_IFUNC support... llvm-readelf: warning: 'conftest': unable to parse DT_JMPREL: virtual address is not in any segment: 0x0 llvm-readelf: warning: 'conftest': unable to parse DT_JMPREL: virtual address is not in any segment: 0x0 yes


On the other hand, I am able to use fairly aggressive flags for the math part (see https://github.com/ms178/archpkgbuilds/blob/main/toolchain-stable/glibc/mathlto.patch.clang) and the following set of flags in `/etc/makepkg.conf`:

export CC=clang export CXX=clang++ export CC_LD=lld export CXX_LD=lld export AR=llvm-ar export NM=llvm-nm export STRIP=llvm-strip export OBJCOPY=llvm-objcopy export OBJDUMP=llvm-objdump export READELF=llvm-readelf export RANLIB=llvm-ranlib export HOSTCC=clang export HOSTCXX=clang++ export HOSTAR=llvm-ar export CPPFLAGS="-D_FORTIFY_SOURCE=0" export CFLAGS="-O3 -march=native -mtune=native -maes -mbmi2 -mpclmul -mllvm -inline-threshold=1500 -mllvm -extra-vectorizer-passes -mllvm -enable-cond-stores-vec -mllvm -slp-vectorize-hor-store -mllvm -enable-loopinterchange -mllvm -enable-loop-distribute -mllvm -enable-unroll-and-jam -mllvm -enable-loop-flatten -mllvm -unroll-runtime-multi-exit -mllvm -aggressive-ext-opt -mllvm -enable-interleaved-mem-accesses -mllvm -enable-masked-interleaved-mem-accesses -fno-math-errno -fno-trapping-math -falign-functions=32 -funroll-loops -fomit-frame-pointer -mprefer-vector-width=256 -mllvm -adce-remove-loops -mllvm -enable-ext-tsp-block-placement -mllvm -enable-gvn-hoist -mllvm -enable-dfa-jump-thread -fcf-protection=none -mharden-sls=none -fgnuc-version=6.5.0" export CXXFLAGS="${CFLAGS} -Wp,-U_GLIBCXX_ASSERTIONS" export LDFLAGS="-Wl,-O3,--as-needed -Wl,-mllvm -Wl,-extra-vectorizer-passes -Wl,-mllvm -Wl,-enable-cond-stores-vec -Wl,-mllvm -Wl,-slp-vectorize-hor-store -Wl,-mllvm -Wl,-enable-loopinterchange -Wl,-mllvm -Wl,-enable-loop-distribute -Wl,-mllvm -Wl,-enable-unroll-and-jam -Wl,-mllvm -Wl,-enable-loop-flatten -Wl,-mllvm -Wl,-unroll-runtime-multi-exit -Wl,-mllvm -Wl,-aggressive-ext-opt -Wl,-mllvm -Wl,-enable-interleaved-mem-accesses -Wl,-mllvm -Wl,-enable-masked-interleaved-mem-accesses -march=native -maes -mbmi2 -mpclmul -fuse-ld=lld -Wl,-zmax-page-size=0x200000 -Wl,-mllvm -Wl,-adce-remove-loops -Wl,-mllvm -Wl,-enable-ext-tsp-block-placement -Wl,-mllvm -Wl,-enable-gvn-hoist -Wl,-mllvm -Wl,-enable-dfa-jump-thread -Wl,--undefined-version -fcf-protection=none -mharden-sls=none" export CCLDFLAGS="$LDFLAGS" export CXXLDFLAGS="$LDFLAGS" export ASFLAGS="-DAVX=1 -DAVX2=1 -DFMA=1"

Yeah, the math library is a more straightforward library since the code uses less of glibc internal tricks to support some glibc specific cases (such as internal alias to PLT avoidance, bootstrap code for the loader, etc.).

ms178 commented 6 months ago

@zatrazz Thanks a lot for your insights!

I think the problem is passing such options through $CC and not through $CFLAGS. Using on CFLAGS/CXXFLAGS I could build with both gcc and clang without any issue (you will need to pass an optimization level though, due the loader bootstrap limitation).

Could you please guide me how I could change that as makepkg might set some variables in the background that I haven't thought about yet? I've tried to ignore all of the flags in /etc/makepkg.conf and setting the CFLAGS/CXXFLAGS via the PKGBUILD, but that doesn't change the outcome when using ffunction-section and related flags.

@romanovj Pardon me for hijacking the thread with some issues of my own. I'd be interested to replicate your issue with BOLT. Do you have a PKGBUILD or a glibc-specific step-by-step guide which I could follow? Do you gather profiles with specific workloads to ensure good profile quality?

romanovj commented 6 months ago

Do you have a PKGBUILD or a glibc-specific step-by-step guide which I could follow?

https://gitlab.archlinux.org/archlinux/packaging/packages/glibc

also minimal config:

 ../glibc/configure \
  --prefix=/root/workdir/install \
  --host=x86_64-linux-gnu \
  --build=x86_64-linux-gnu \
  CC="gcc -m64" \
  CXX="g++ -m64" \
  CFLAGS="-O2" \
  CXXFLAGS="-O2"

With gcc or clang(azanella/clang)

Plus -Wl,--emit-relocs and -fno-reorder-blocks-and-partition only for GCC

ms178 commented 6 months ago

@romanovj Thanks, but don't you need to instrument with BOLT and gather profiles first?

At least that's common practice, e.g. with LLVM/Clang: https://github.com/ms178/archpkgbuilds/blob/main/toolchain-experimental/llvm-bolt-scripts-master/build_stage3-bolt-without-sampling.bash

romanovj commented 6 months ago

@ms178 can't add instrumentation

llvm-bolt libc.so -o libc.so.bolt --instrument --instrumentation-file=/tmp/libc.so --instrumentation-file-append-pid 
......
BOLT-ERROR: Offset overflow for dynamic relocation
ms178 commented 6 months ago

@romanovj I am afraid, but I think it is a bit more complicated than that.

Here is my second non-working attempt for a PKGBUILD (you can delete the custom patches that I apply):

# Maintainer: Marcus Seyfarth <marcus85@gmx.de>

pkgbase=glibc
pkgname=(glibc lib32-glibc)
pkgver=2.39
pkgrel=16.1
pkgdesc='GNU C Library'
arch=('x86_64')
url='https://www.gnu.org/software/libc'
license=('GPL' 'LGPL')
depends=('linux-api-headers' 'tzdata')
makedepends=('git' 'gd' 'python' 'lib32-gcc-libs')
optdepends=('perl: for mtrace'
            'gd: graph image generation with memusage')
backup=(etc/gai.conf
        etc/locale.gen
        etc/nscd.conf)
options=('staticlibs' '!lto' 'buildflags')
install=glibc.install
source=("git+https://sourceware.org/git/glibc.git#branch=azanella/clang"
        locale-gen
        locale.gen.txt
        lib32-glibc.conf
        malloc_tune.patch
        #mathlto.patch
        tzselect-proper-zone-file.patch
        04-mandriva-va_args.patch
        05-mandriva-zstdcompressedlocals.patch
        06-mandriva-nss-crash.patch
        07-mandriva-nostrictaliasing.patch
        nptl.patch
        )
sha256sums=('SKIP'
            )

prepare() {
    mkdir -p glibc-build lib32-glibc-build

    [[ -d glibc-$pkgver ]] && ln -s glibc-$pkgver glibc

    local src
    for src in "${source[@]}"; do
        src="${src%%::*}"
        src="${src##*/}"
        [[ $src = *.patch ]] || continue
        echo "Applying patch $src..."
        patch --directory="glibc" --forward --strip=1 < "$src"
    done
}

build() {
    cd "$srcdir/glibc-build"

    echo "slibdir=/usr/lib" >> configparms
    echo "rtlddir=/usr/lib" >> configparms
    echo "sbindir=/usr/bin" >> configparms
    echo "rootsbindir=/usr/bin" >> configparms

    CFLAGS=${CFLAGS/-Wp,-D_FORTIFY_SOURCE=2/}

    "$srcdir/glibc/configure" \
        --prefix=/usr \
        --libdir=/usr/lib \
        --libexecdir=/usr/lib \
        --with-headers=/usr/include \
        --disable-bind-now \
        --without-selinux \
        --disable-fortify-source \
        --disable-systemtap \
        --disable-cet \
        --enable-kernel=6.8.1 \
        --enable-multi-arch \
        --disable-profile \
        --disable-crypt \
        --disable-werror

    echo "build-programs=no" >> configparms
    make -O

    sed -i "/build-programs=/s#no#yes#" configparms
    echo "CFLAGS += -Wp,-D_FORTIFY_SOURCE=0" >> configparms
    make -O

    # Instrument Glibc with BOLT
    echo "Instrumenting Glibc with BOLT"
llvm-bolt --lite=false \
         --instrument \
         --instrumentation-file-append-pid \
         --instrumentation-file="$srcdir/glibc-build/bolt-output/libc.so.fdata" \
         "$srcdir/glibc-build/libc.so" \
         -o "$srcdir/glibc-build/libc.so.inst"
         echo "Moving instrumented Glibc binary"
mv "$srcdir/glibc-build/libc.so" "$srcdir/glibc-build/libc.so.org"
mv "$srcdir/glibc-build/libc.so.inst" "$srcdir/glibc-build/libc.so"

# Gather profiles with the Glibc test suite
make check

# Optimize Glibc with BOLT using the collected profile
echo "Optimizing Glibc with BOLT"
llvm-bolt -o "$srcdir/glibc-build/bolt-output/libc.so" \
     --data "$srcdir/glibc-build/bolt-output/libc.so.fdata" \
     "$srcdir/glibc-build/libc.so.org" \
     -reorder-blocks=ext-tsp \
     -reorder-functions=cdsort \
     -split-functions \
     -split-all-cold \
     -split-eh \
     -dyno-stats \
     -icf=1 \
     -lite=0

echo "Replacing original Glibc binary with optimized one"
mv "$srcdir/glibc-build/libc.so" "$srcdir/glibc-build/libc.so.orig"
mv "$srcdir/glibc-build/bolt-output/libc.so" "$srcdir/glibc-build/libc.so"

cd "$srcdir/lib32-glibc-build"
export CC="clang -m32 -mfpmath=sse -mstackrealign"
export CXX="clang++ -m32 -mfpmath=sse -mstackrealign"

echo "slibdir=/usr/lib32" >> configparms
echo "rtlddir=/usr/lib32" >> configparms
echo "sbindir=/usr/bin" >> configparms
echo "rootsbindir=/usr/bin" >> configparms

"$srcdir/glibc/configure" \
    --host=i686-pc-linux-gnu \
    --prefix=/usr \
    --libdir=/usr/lib32 \
    --libexecdir=/usr/lib32 \
    --disable-cet \
    --enable-kernel=6.8.1 \
    --disable-bind-now \
    --without-selinux \
    --disable-fortify-source \
    --disable-systemtap \
    --disable-profile \
    --disable-crypt \
    --disable-sanity-checks \
    --disable-werror \
    "${_configure_flags[@]}"

echo "build-programs=no" >> configparms
make -O

sed -i "/build-programs=/s#no#yes#" configparms
echo "CFLAGS += -Wp,-D_FORTIFY_SOURCE=0" >> configparms
make -O

    # Instrument 32-bit Glibc with BOLT
    echo "Instrumenting 32-bit Glibc with BOLT"
    llvm-bolt --lite=false \
     --instrument \
     --instrumentation-file-append-pid \
     --instrumentation-file="$srcdir/lib32-glibc-build/bolt-output/libc.so.fdata" \
     "$srcdir/lib32-glibc-build/libc.so" \
     -o "$srcdir/lib32-glibc-build/libc.so.inst"

echo "Moving instrumented 32-bit Glibc binary"
mv "$srcdir/lib32-glibc-build/libc.so.6" "$srcdir/lib32-glibc-build/libc.so.org"
mv "$srcdir/lib32-glibc-build/libc.so.6.inst" "$srcdir/lib32-glibc-build/libc.so"

    # Gather profiles with the Glibc test suite
    make check

    # Optimize 32-bit Glibc with BOLT using the collected profile
    echo "Optimizing 32-bit Glibc with BOLT"
    llvm-bolt -o "$srcdir/lib32-glibc-build/bolt-output/libc.so" \
         --data "$srcdir/lib32-glibc-build/bolt-output/libc.so.fdata" \
         "$srcdir/lib32-glibc-build/libc.so.org" \
         -reorder-blocks=ext-tsp \
         -reorder-functions=cdsort \
         -split-functions \
         -split-all-cold \
         -split-eh \
         -dyno-stats \
         -icf=1 \
         -lite=0

    echo "Replacing original 32-bit Glibc binary with optimized one"
    mv "$srcdir/lib32-glibc-build/libc.so" "$srcdir/lib32-glibc-build/elf/libc.so.orig"
    mv "$srcdir/lib32-glibc-build/bolt-output/libc.so.6" "$srcdir/lib32-glibc-build/elf/libc.so"

    elf/ld.so --library-path "$PWD" locale/localedef -c -f ../glibc/localedata/charmaps/UTF-8 -i ../glibc/localedata/locales/C ../C.UTF-8/
}

package_glibc() {
    pkgdesc='GNU C Library'
    depends=('linux-api-headers>=4.10' tzdata filesystem)
    optdepends=('gd: for memusagestat'
                'perl: for mtrace')
    install=glibc.install
    backup=(etc/gai.conf
            etc/locale.gen
            etc/nscd.conf)

    make -C glibc-build install_root="$pkgdir" install
    rm -f "$pkgdir"/etc/ld.so.cache

    # Shipped in tzdata
    rm -f "$pkgdir"/usr/bin/{tzselect,zdump,zic}

    cd glibc

    install -dm755 "$pkgdir"/usr/lib/{locale,systemd/system,tmpfiles.d}
    install -m644 nscd/nscd.conf "$pkgdir/etc/nscd.conf"
    install -m644 nscd/nscd.service "$pkgdir/usr/lib/systemd/system"
    install -m644 nscd/nscd.tmpfiles "$pkgdir/usr/lib/tmpfiles.d/nscd.conf"
    install -dm755 "$pkgdir/var/db/nscd"

    install -m644 posix/gai.conf "$pkgdir"/etc/gai.conf

    install -m755 "$srcdir/locale-gen" "$pkgdir/usr/bin"

    # Create /etc/locale.gen
    install -m644 "$srcdir/locale.gen.txt" "$pkgdir/etc/locale.gen"
    sed -e '1,3d' -e 's|/| |g' -e 's|\\| |g' -e 's|^|#|g' \
        "$srcdir/glibc/localedata/SUPPORTED" >> "$pkgdir/etc/locale.gen"

    # install C.UTF-8 so that it is always available
    install -dm755 "$pkgdir/usr/lib/locale"
    cp -r "$srcdir/C.UTF-8" -t "$pkgdir/usr/lib/locale"
    sed -i '/#C\.UTF-8 /d' "$pkgdir/etc/locale.gen"

    # Install the optimized libc.so.6
    install -m755 "$srcdir/glibc-build/elf/libc.so.6" "$pkgdir/usr/lib/libc.so.6"
}

package_lib32-glibc() {
    pkgdesc='GNU C Library (32-bit)'
    depends=("glibc=$pkgver")
    options+=('!emptydirs')

    cd lib32-glibc-build

    make install_root="$pkgdir" install
    rm -rf "$pkgdir"/{etc,sbin,usr/{bin,sbin,share},var}

    # We need to keep 32 bit specific header files
    find "$pkgdir/usr/include" -type f -not -name '*-32.h' -delete

    # Dynamic linker
    install -d "$pkgdir/usr/lib"
    ln -s ../lib32/ld-linux.so.2 "$pkgdir/usr/lib/"

    # Add lib32 paths to the default library search path
    install -Dm644 "$srcdir/lib32-glibc.conf" "$pkgdir/etc/ld.so.conf.d/lib32-glibc.conf"

    # Symlink /usr/lib32/locale to /usr/lib/locale
    ln -s ../lib/locale "$pkgdir/usr/lib32/locale"

    # Install the optimized 32-bit libc.so.6
    install -m755 "$srcdir/lib32-glibc-build/elf/libc.so.6" "$pkgdir/usr/lib32/libc.so.6"
}

I get this output:

Instrumenting Glibc with BOLT
BOLT-INFO: shared object or position-independent executable detected
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: fa4cc39255767bbaf63a6a3b445dc94b43ebd447
BOLT-INFO: first alloc address is 0x0
BOLT-INFO: creating new program header table at address 0xa00000, offset 0xa00000
BOLT-INFO: enabling relocation mode
BOLT-INFO: forcing -jump-tables=move for instrumentation
BOLT-INFO: enabling -align-macro-fusion=all since no profile was specified
BOLT-ERROR: bad input binary, global symbol "sys_nerr" is not unique