llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
http://llvm.org
Other
27.97k stars 11.54k forks source link

[BOLT] GOT array pointer incorrectly rewritten #100096

Open peterwaller-arm opened 1 month ago

peterwaller-arm commented 1 month ago

Problem

The problem is present on recent (at time of writing) main commit 5f05d5ec8f9bb15c0ac29fce843a2c73165ac414.

Static binary glibc startup crash message

This manifests in glibc static binaries on aarch64-linux with the binary crashing on startup with the error message Unexpected reloc type in static binary. What’s happening is that the glibc startup code iterates over an array of reloc, but it runs off the end of the array because the array end pointer which lives in the GOT is no longer valid.

It is currently known to manifest on aarch64-linux with glibc, but underlying issue may not be unique to that target or scenario, it may be at risk of happening elsewhere.

Test case (copy paste whole block, produces ‘./testcase’)

clang -nostartfiles -nodefaultlibs -static \
  -Wl,--emit-relocs \
  -Wl,--section-start=.array=0x1000 \
  -Wl,--section-start=.text=0x1004 \
  -Wl,--section-start=.data=0x800000 \
  -x assembler -o testcase - <<EOF

.section .array, "a", @progbits
.globl array_start
.globl array_end
array_start:
  .word 0
array_end:

.section .text
.globl _start
_start:
  adrp x1, #:got:array_start
  ldr x1, [x1, #:gotpage_lo15:array_start]
  adrp x0, #:got:array_end
  ldr x0, [x0, #:gotpage_lo15:array_end]
  adrp x2, #:got:_start
  ldr x2, [x2, #:gotpage_lo15:_start]
  ret
EOF

objdump --disassemble-all --reloc testcase output:

0000000000001000 <array_start>:
    1000:       00000000        udf     #0

Disassembly of section .text:

0000000000001004 <_start>:
    1004:       d00000e1        adrp    x1, 1f000 <_start+0x1dffc>
                        1004: R_AARCH64_ADR_GOT_PAGE    array_start
    1008:       f947f421        ldr     x1, [x1, #4072]
                        1008: R_AARCH64_LD64_GOTPAGE_LO15       array_start
    100c:       d00000e0        adrp    x0, 1f000 <_start+0x1dffc>
                        100c: R_AARCH64_ADR_GOT_PAGE    array_end
    1010:       f947f800        ldr     x0, [x0, #4080]
                        1010: R_AARCH64_LD64_GOTPAGE_LO15       array_end
    1014:       d00000e2        adrp    x2, 1f000 <_start+0x1dffc>
                        1014: R_AARCH64_ADR_GOT_PAGE    _start
    1018:       f947fc42        ldr     x2, [x2, #4088]
                        1018: R_AARCH64_LD64_GOTPAGE_LO15       _start
    101c:       d65f03c0        ret

Disassembly of section .got:

000000000001ffc8 <.got>:
        ...

000000000001ffe0 <_GLOBAL_OFFSET_TABLE_>:
        ...
   1ffe8:       00001000        udf     #4096
   1ffec:       00000000        udf     #0
   1fff0:       00001004        udf     #4100
   1fff4:       00000000        udf     #0
   1fff8:       00001004        udf     #4100
   1fffc:       00000000        udf     #0

What breaks

Note above there are GOT entries 1ffe8 (array_start), 1fff0 (array_end) and 1fff8 (_start), and that the address of array_end shares its address with _start (equal to 00001004).

Running llvm-bolt -o testcase.bolt testcase, BOLT erroneously rewrites both these got entries, the new GOT contains:

000000000001ffe0 <_GLOBAL_OFFSET_TABLE_>:
        ...
   1ffe8:       00001000        udf     #4096
   1ffec:       00000000        udf     #0
   1fff0:       00400000        .inst   0x00400000 ; undefined
   1fff4:       00000000        udf     #0
   1fff8:       00400000        .inst   0x00400000 ; undefined
   1fffc:       00000000        udf     #0

This is incorrect because 1fff0 has been rewritten to point to the new start, but the array data has not moved.

Code which is doing for (void *i = array_start; i != array_end; i++) will now run off the end of the array.

Why does it break

The culprit is the code in patchELFGOT which currently assumes that all pointers in the GOT may be interpreted as function pointers:

https://github.com/llvm/llvm-project/blob/5f05d5ec8f9bb15c0ac29fce843a2c73165ac414/bolt/lib/Rewrite/RewriteInstance.cpp#L5271-L5277

Thinking about solutions

Can the symbol referenced by a GOT entry be reconstructed?

So far as I’m currently aware we don’t have a straightforward way to determine which symbol a GOT entry points to. If only the address contained within the GOT entry is considered, this does not distinguish _start and array_end as in the reproducer provided above.

Determining which relocations point to a given GOT entry could require determining register values for some programs, since we don’t get a fully-qualified pointer to the GOT from a GOT relocation for a specific symbol alone. Consider that you could have a register value holding a base pointer to a GOT page, and that register value gets reused for multiple different GOT references within the page.

The base pointer could be hoisted out of a loop and ~would not itself~ relocations initializing it would not contain information connecting it to the symbol of interest, so determining the GOT entry being referenced by a relocation would require some kind of symbolic execution to determine the base register value in the worst case.

cc @aaupov @maksfb for BOLT cc @MaskRay because I think you might be interested and knowledgable, particularly if there is some information BOLT can use to differentiate GOT entries containing the same address but referencing different symbols straightforwardly, or if there are alternative approaches to consider in light of additional linker knowledge.

llvmbot commented 1 month ago

@llvm/issue-subscribers-bolt

Author: Peter Waller (peterwaller-arm)

# Problem * An address alone is not enough to distinguish a pointer-to-data vs pointer-to-function (see test case). * The `patchELFGOT` function currently assumes values in the GOT are pointers-to-functions, and if the function moves, the address is updated. * In the case of an ‘end of array’ pointer address coinciding with an address of a function which is moved by bolt, bolt updates the ‘end of array’ pointer in the GOT when it should not. # Static binary glibc startup crash message This manifests in glibc static binaries on aarch64-linux with the binary crashing on startup with the error message `Unexpected reloc type in static binary`. What’s happening is that the glibc startup code iterates over an array of reloc, but it runs off the end of the array because the array end pointer which lives in the GOT is no longer valid. It is currently known to manifest on aarch64-linux with glibc, but underlying issue may not be unique to that target or scenario, it may be at risk of happening elsewhere. # Test case (copy paste whole block, produces ‘./testcase’) ``` clang -nostartfiles -nodefaultlibs -static \ -Wl,--emit-relocs \ -Wl,--section-start=.array=0x1000 \ -Wl,--section-start=.text=0x1004 \ -Wl,--section-start=.data=0x800000 \ -x assembler -o testcase - <<EOF .section .array, "a", @progbits .globl array_start .globl array_end array_start: .word 0 array_end: .section .text .globl _start _start: adrp x1, #:got:array_start ldr x1, [x1, #:gotpage_lo15:array_start] adrp x0, #:got:array_end ldr x0, [x0, #:gotpage_lo15:array_end] adrp x2, #:got:_start ldr x2, [x2, #:gotpage_lo15:_start] ret EOF ``` # `objdump --disassemble-all --reloc testcase` output: ``` 0000000000001000 <array_start>: 1000: 00000000 udf #0 Disassembly of section .text: 0000000000001004 <_start>: 1004: d00000e1 adrp x1, 1f000 <_start+0x1dffc> 1004: R_AARCH64_ADR_GOT_PAGE array_start 1008: f947f421 ldr x1, [x1, #4072] 1008: R_AARCH64_LD64_GOTPAGE_LO15 array_start 100c: d00000e0 adrp x0, 1f000 <_start+0x1dffc> 100c: R_AARCH64_ADR_GOT_PAGE array_end 1010: f947f800 ldr x0, [x0, #4080] 1010: R_AARCH64_LD64_GOTPAGE_LO15 array_end 1014: d00000e2 adrp x2, 1f000 <_start+0x1dffc> 1014: R_AARCH64_ADR_GOT_PAGE _start 1018: f947fc42 ldr x2, [x2, #4088] 1018: R_AARCH64_LD64_GOTPAGE_LO15 _start 101c: d65f03c0 ret Disassembly of section .got: 000000000001ffc8 <.got>: ... 000000000001ffe0 <_GLOBAL_OFFSET_TABLE_>: ... 1ffe8: 00001000 udf #4096 1ffec: 00000000 udf #0 1fff0: 00001004 udf #4100 1fff4: 00000000 udf #0 1fff8: 00001004 udf #4100 1fffc: 00000000 udf #0 ``` # What breaks Note above there are GOT entries `1ffe8` (`array_start`), `1fff0` (`array_end`) and `1fff8` (`_start`), and that the address of `array_end` shares its address with `_start` (equal to `00001004`). Running `llvm-bolt -o testcase.bolt testcase`, BOLT erroneously rewrites both these got entries, the new GOT contains: ``` 000000000001ffe0 <_GLOBAL_OFFSET_TABLE_>: ... 1ffe8: 00001000 udf #4096 1ffec: 00000000 udf #0 1fff0: 00400000 .inst 0x00400000 ; undefined 1fff4: 00000000 udf #0 1fff8: 00400000 .inst 0x00400000 ; undefined 1fffc: 00000000 udf #0 ``` This is incorrect because `1fff0` has been rewritten to point to the new start, but the array data has not moved. Code which is doing `for (void *i = array_start; array_start != array_end; i++)` will now run off the end of the array. # Why does it break The culprit is the code in `patchELFGOT` which currently assumes that all pointers in the GOT may be interpreted as function pointers: https://github.com/llvm/llvm-project/blob/5f05d5ec8f9bb15c0ac29fce843a2c73165ac414/bolt/lib/Rewrite/RewriteInstance.cpp#L5271-L5277 # Thinking about solutions * The case of data and code sharing an address is an awkward edge case. It happens here because it’s valid to have a pointer to ‘one past the end of an array’ (per C standard “Additive operators” regarding integer addition to a pointer), and that pointer can end up being referenced through the GOT. * This edge case is rare but is at last known to create an issue with glibc static binaries. There could be other binaries in the wild with problems that may be revealed at runtime. * There is an immediate problem of making glibc static binaries work. * @paschalis-mpeis attempted a fix in https://github.com/llvm/llvm-project/pull/97710, but the fix is incomplete in that it does not handle the case in general. I’m not sure about the rationale of the condition used or whether it is valid. * We may want to special case somehow this if it leads to a clean immediate solution, with an approach similar to that presented by @paschalis-mpeis above, though the exact conditions are subject for discussion. * If the type referred to by a GOT entry was known, `patchELFGOT` could query the type and use conditionally use the correct getNewFunctionAddress / getBinaryDataAtAddress in each case. * What does distinguish `array_end` and `_start` is that the symbols belong to different sections (even though they share an address), and the sections are of different types. If it were possible to determine the symbol referenced in the GOT then patchELFGOT could straightforwardly avoid patching entries which reference data symbols (or patch them if it's moving them). * I am not currently aware of any way to determine the type of the data referred to by a GOT entry, other than to look for relocations which point to the entry [discussion below]. # Can the symbol referenced by a GOT entry be reconstructed? So far as I’m currently aware we don’t have a straightforward way to determine which symbol a GOT entry points to. If only the address contained within the GOT entry is considered, this does not distinguish `_start` and `array_end` as in the reproducer provided above. Determining which relocations point to a given GOT entry could require determining register values for some programs, since we don’t get a fully-qualified pointer to the GOT from a GOT relocation for a specific symbol alone. Consider that you could have a register value holding a base pointer to a GOT page, and that register value gets reused for multiple different GOT references within the page. The base pointer could be hoisted out of a loop and would not itself contain information connecting it to the symbol of interest, so determining the GOT entry being referenced by a relocation would require some kind of symbolic execution to determine the base register value in the worst case. cc @aaupov @maksfb for BOLT cc @MaskRay because I think you might be interested and knowledgable, particularly if there is some information BOLT can use to differentiate symbols which share the same address in the GOT straightforwardly, or if there are alternative approaches to consider in light of additional linker knowledge.
yota9 commented 1 month ago

What's worse is that during we add new objects in discoverFileObjects() - registerName() we're connecting them to section using their address and thus connecting them to a wrong section in such cases. @maksfb @rafaelauler I think we need to use SymbolRef.getSection() on the first place and we need to extend api of creating BinaryData with passing its real section. The same probably should be applied on BinaryFunction created further, now it is created with getSectionForAddress. What do you think?

yota9 commented 1 month ago

Hello @peterwaller-arm ! I've implemented not fix, but check and abort in such situations in BOLT in #100801. Please take a look to a commit message that describes my thoughts about handling such situations in BOLT levels. TL'DR it would be good, but currently too expensive and also I think we need to blame linker that left GOT table in fully static binary, it seems that if you would use LLD the problem would be auto eliminated :)