Open dannf opened 3 years ago
console log of a linux reboot -> no crash -> linux reboot -> crash -> watchdog reboot -> crash -> watchdog reboot -> crash
I also experienced the crash of AArch64 build. Based on your finding, I looked into crt0-efi-aarch64.S which invokes _relocate(): https://github.com/rhboot/gnu-efi/blob/f0f98248649b4b219764bd46854697bcec858081/gnuefi/crt0-efi-aarch64.S#L149-L163
When checking _DYNAMIC, it's mentioned in the gnuefi document: https://github.com/rhboot/gnu-efi/blob/f0f98248649b4b219764bd46854697bcec858081/README.gnuefi#L300-L357
Per the document, _DYNAMIC is supposed to be the offset to .dynamic. Unlike x86_64, objcopy doesn't support AArch64 EFI, so those pe-coff section header are crafted manually in crt0-efi-aarch64.S, and .dynamic was merged into .data. To track _DYNAMIC, I first checked objdump -x shimaa64.so | grep _DYNAMIC
:
000000000007a910 l O *ABS* 0000000000000000 _DYNAMIC
and then dumped 0x07a910
of shimaa64.so:
0007a910 ff 00 00 00 0b 00 00 00 44 20 0a 00 00 00 00 00 |........D ......|
0007a920 00 00 00 00 00 00 00 00 58 48 0a 00 00 00 00 00 |........XH......|
0007a930 58 48 0a 00 00 00 00 00 00 01 00 00 0b 00 00 00 |XH..............|
0007a940 4f 20 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 |O ..............|
0007a950 76 52 0a 00 00 00 00 00 79 48 0a 00 00 00 00 00 |vR......yH......|
0007a960 01 01 00 00 08 00 00 00 5a 20 0a 00 00 00 00 00 |........Z ......|
0007a970 00 00 00 00 00 00 00 00 7d 48 0a 00 00 00 00 00 |........}H......|
On the other hand, the same content in shimaa64.efi was in a different offset:
0006a910 ff 00 00 00 0b 00 00 00 44 20 0a 00 00 00 00 00 |........D ......|
0006a920 00 00 00 00 00 00 00 00 58 48 0a 00 00 00 00 00 |........XH......|
0006a930 58 48 0a 00 00 00 00 00 00 01 00 00 0b 00 00 00 |XH..............|
0006a940 4f 20 0a 00 00 00 00 00 00 00 00 00 00 00 00 00 |O ..............|
0006a950 76 52 0a 00 00 00 00 00 79 48 0a 00 00 00 00 00 |vR......yH......|
0006a960 01 01 00 00 08 00 00 00 5a 20 0a 00 00 00 00 00 |........Z ......|
0006a970 00 00 00 00 00 00 00 00 7d 48 0a 00 00 00 00 00 |........}H......|
But, there is no 10 a9 06 00 00 00 00 00
in shimaa64.efi but 10 a9 07 00 00 00 00 00
.
0007a200 00 00 00 00 00 00 00 00 10 a9 07 00 00 00 00 00 |................|
0007a210 e0 37 07 00 00 00 00 00 70 57 07 00 00 00 00 00 |.7......pW......|
0007a220 b8 22 07 00 00 00 00 00 58 89 07 00 00 00 00 00 |."......X.......|
0007a230 00 c0 07 00 00 00 00 00 b0 61 07 00 00 00 00 00 |.........a......|
0007a240 d8 21 07 00 00 00 00 00 48 b7 05 00 00 00 00 00 |.!......H.......|
0007a250 48 5f 07 00 00 00 00 00 28 60 07 00 00 00 00 00 |H_......(`......|
0007a260 60 70 06 00 00 00 00 00 08 56 07 00 00 00 00 00 |`p.......V......|
The same content in shimaa64.so:
0008a200 00 00 00 00 00 00 00 00 10 a9 07 00 00 00 00 00 |................|
0008a210 e0 37 07 00 00 00 00 00 70 57 07 00 00 00 00 00 |.7......pW......|
0008a220 b8 22 07 00 00 00 00 00 58 89 07 00 00 00 00 00 |."......X.......|
0008a230 00 c0 07 00 00 00 00 00 b0 61 07 00 00 00 00 00 |.........a......|
0008a240 d8 21 07 00 00 00 00 00 48 b7 05 00 00 00 00 00 |.!......H.......|
0008a250 48 5f 07 00 00 00 00 00 28 60 07 00 00 00 00 00 |H_......(`......|
0008a260 60 70 06 00 00 00 00 00 08 56 07 00 00 00 00 00 |`p.......V......|
So it seems there is a gap of 0x10000 in shimaa64.efi, and maybe this caused the crash.
The first 0x10000 data of shimaa64.so is the elf header and we use objcopy -O binary
to strip it. This explains the 0x10000 gap between shimaa64.so and shimaa64.efi.
Nice find @lcp . I had added some debug Print()
statements yesterday to _relocate()
and it thinks it is working through a relsz
of 0x4f000. If my assumption is correct that there should not /be/ any relocations, then it definitely looks like it is just processing garbage.
Disassembled _start in shimaa64.so:
0000000000001000 <_start>:
1000: a9be7bfd stp x29, x30, [sp, #-32]!
1004: 910003fd mov x29, sp
1008: a90107e0 stp x0, x1, [sp, #16]
100c: aa0003e2 mov x2, x0
1010: aa0103e3 mov x3, x1
1014: 10ff7f60 adr x0, 0 <ImageBase>
1018: b00003c1 adrp x1, 7a000 <ErrorCodeTable+0x60>
101c: 91244021 add x1, x1, #0x910
1020: 94019311 bl 65c64 <_relocate>
1024: b5000060 cbnz x0, 1030 <_start+0x30>
1028: a94107e0 ldp x0, x1, [sp, #16]
102c: 94000a38 bl 390c <efi_main>
1030: a8c27bfd ldp x29, x30, [sp], #32
1034: d65f03c0 ret
_DYNAMIC is the offset mentioned at '1018': 0x7a000. The content of shimaa64.so at 0x7a000+0x1018+0x10000:
0008a010 07 00 00 00 00 00 00 80 50 f8 0a 00 00 00 00 00 |........P.......|
0008a020 08 00 00 00 00 00 00 80 6a f8 0a 00 00 00 00 00 |........j.......|
0008a030 09 00 00 00 00 00 00 80 8a f8 0a 00 00 00 00 00 |................|
0008a040 0a 00 00 00 00 00 00 80 ac f8 0a 00 00 00 00 00 |................|
0008a050 0b 00 00 00 00 00 00 80 ca f8 0a 00 00 00 00 00 |................|
0008a060 0c 00 00 00 00 00 00 80 e2 f8 0a 00 00 00 00 00 |................|
0008a070 0d 00 00 00 00 00 00 80 f4 f8 0a 00 00 00 00 00 |................|
0008a080 0e 00 00 00 00 00 00 80 10 f9 0a 00 00 00 00 00 |................|
0008a090 0f 00 00 00 00 00 00 80 24 f9 0a 00 00 00 00 00 |........$.......|
0008a0a0 10 00 00 00 00 00 00 80 40 f9 0a 00 00 00 00 00 |........@.......|
Disassembled _start in shimaa64.efi:
0000000000001000 .text:
1000: fd 7b be a9 stp x29, x30, [sp, #-32]!
1004: fd 03 00 91 mov x29, sp
1008: e0 07 01 a9 stp x0, x1, [sp, #16]
100c: e2 03 00 aa mov x2, x0
1010: e3 03 01 aa mov x3, x1
1014: 60 7f ff 10 adr x0, #-4116
1018: c1 03 00 b0 adrp x1, #495616
101c: 21 40 24 91 add x1, x1, #2320
1020: 11 93 01 94 bl #412740 <.text+0x64c64>
1024: 60 00 00 b5 cbnz x0, #12 <.text+0x30>
1028: e0 07 41 a9 ldp x0, x1, [sp, #16]
102c: 38 0a 00 94 bl #10464 <.text+0x290c>
1030: fd 7b c2 a8 ldp x29, x30, [sp], #32
1034: c0 03 5f d6 ret
In this case, the offset to _DYNAMIC is 495616=0x79000. The content of shimaa64.so at 0x79000+0x1018:
0007a010 07 00 00 00 00 00 00 80 50 f8 0a 00 00 00 00 00 |........P.......|
0007a020 08 00 00 00 00 00 00 80 6a f8 0a 00 00 00 00 00 |........j.......|
0007a030 09 00 00 00 00 00 00 80 8a f8 0a 00 00 00 00 00 |................|
0007a040 0a 00 00 00 00 00 00 80 ac f8 0a 00 00 00 00 00 |................|
0007a050 0b 00 00 00 00 00 00 80 ca f8 0a 00 00 00 00 00 |................|
0007a060 0c 00 00 00 00 00 00 80 e2 f8 0a 00 00 00 00 00 |................|
0007a070 0d 00 00 00 00 00 00 80 f4 f8 0a 00 00 00 00 00 |................|
0007a080 0e 00 00 00 00 00 00 80 10 f9 0a 00 00 00 00 00 |................|
0007a090 0f 00 00 00 00 00 00 80 24 f9 0a 00 00 00 00 00 |........$.......|
0007a0a0 10 00 00 00 00 00 00 80 40 f9 0a 00 00 00 00 00 |........@.......|
So at least the offset to _DYNAMIC is correct.
Nice find @lcp . I had added some debug
Print()
statements yesterday to_relocate()
and it thinks it is working through arelsz
of 0x4f000. If my assumption is correct that there should not /be/ any relocations, then it definitely looks like it is just processing garbage.
Indeed, I also found that a huge relsz
(0x4e000) in my AArch64 shim build:
0007a970 07 00 00 00 00 00 00 00 00 d0 07 00 00 00 00 00 |................|
0007a980 08 00 00 00 00 00 00 00 00 e0 04 00 00 00 00 00 |................|
0007a990 09 00 00 00 00 00 00 00 18 00 00 00 00 00 00 00 |................|
DT_RELA: 0x7d000 DT_RELASZ: 0x4e000 DT_RELAENT: 0x18
In my case, rela
is 0x7d000 and the total image file size is 0xcb000. relsz
is 0x4e000 = 0xcb000-0x7d000, so that means that the relocation table covers the whole image below 0x7d000. Then I checked further:
00099790 03 04 00 00 00 00 00 00 54 88 0a 00 00 00 00 00 |........T.......|
000997a0 80 97 07 00 00 00 00 00 03 04 00 00 00 00 00 00 |................|
000997b0 40 71 07 00 00 00 00 00 53 00 48 00 49 00 4d 00 |@q......S.H.I.M.|
000997c0 5f 00 44 00 45 00 42 00 55 00 47 00 00 00 00 00 |_.D.E.B.U.G.....|
000997d0 61 00 64 00 64 00 2d 00 73 00 79 00 6d 00 62 00 |a.d.d.-.s.y.m.b.|
000997e0 6f 00 6c 00 2d 00 66 00 69 00 6c 00 65 00 20 00 |o.l.-.f.i.l.e. .|
R_AARCH64_RELATIVE is 1027(0x403), so the entries containing 03 04 00 00 00 00 00 00
are supposed to be struct Elf64_Rela
. However, there are strings in the image after 0x997b8, and those are clearly not struct Elf64_Rela
. So relsz
is way too large than expected...
Keeping in mind I've little experience w/ binary formats, I wonder what it means that readelf
claims there is no dynamic section:
$ readelf -a shimaa64.so 2> /dev/null| grep -i DYNAMIC
There is no dynamic section in this file.
1889: 00000000000798e8 0 OBJECT LOCAL DEFAULT ABS _DYNAMIC
Is it possible that there simply is no .dynamic array, and the value of _DYNAMIC
's is undefined? gnu-efi
seems to always assume that there will be a .dynamic
section - and that it will contain at least a terminating DT_NULL
entry, but maybe it shouldn't?
The dynamic section is merged into the data section: https://github.com/rhboot/shim/blob/main/elf_aarch64_efi.lds#L30
Ideally, those elf sections would be converted to pe-coff sections by objcopy, but it wasn't implemented for AArch64 yet, so shim currently merges those sections into text, data, or rodata sections and crafts the section headers with assembly.
Yeah, I tried moving the .dynamic
section out of .data
and into it's own section in the linker script to see if that would make a .dynamic
section appear in the elf .so, but it didn't. That's what made me think it may just not exist. Note that none of the .o files that are being linked together have a .dynamic
section, and I wondered if that would make *(.dynamic)
NULL.
EDIT: I was doing it wrong, I do see a .dynamic
section appear in the elf if I create one at the end.
I started to realize that the relocation is necessary. For example, a function pointer is stored in .data section and its content points to a function in .text section. When EFI image is loaded, those function pointers needs to be relocated from the original addresses (rel->r_addend) to the actual running addresses (ImageBase + rel->r_addend). It seems to me that objcopy tends to assign the size of the whole .rodata section to .rela, and this caused _relocate() to go through the whole .rodata section. We have to fix the size of .rela to avoid the potential crash.
Correction: It's ld not objcopy to write the section size.
I wrote a testing patch for elf_aarch64_efi.lds: https://github.com/lcp/shim/commit/686ff993cb8871e1c688277eca05fa01218d8733
In the .dynamic section of patched shimaa64.efi:
0007a970 07 00 00 00 00 00 00 00 00 b0 09 00 00 00 00 00 |................|
0007a980 08 00 00 00 00 00 00 00 b8 c7 01 00 00 00 00 00 |................|
0007a990 09 00 00 00 00 00 00 00 18 00 00 00 00 00 00 00 |................|
DT_RELA: 0x9b000 DT_RELASZ: 0x1c7b8 DT_RELAENT: 0x18
The size looks much reasonable now :)
Oh, nice @lcp. Yeah, with your patch I also see a more reasonable relsz: 0x1c7a0
, and it is working for me. I've put my guest in a reboot loop to see if it hits any other issues - at #71 so far w/o a problem.
Survived > 1600 reboots before I stopped it, so with this arm64 booting seems pretty robust :)
Yay, good news. Although I'm definitely feeling we should be focussing on getting proper toolchain support which would make this moot... :-)
@steve-mcintyre
Is there a concise technical description of the missing pieces in the toolchain? I know some of the Arm folks working in that area and would be happy to convey a detailed request for assistance.
@vielmetti Since objcopy doesn't support PE/COFF for AArch64, shim/gnu-efi has to craft the PE/COFF header like this:
To make the header work properly, we have to tweak some fields in the section table such as VirtualSize and the variables in the ld script. If objcopy could support pe-aarch64 target, the we can get rid of those workarounds and just convert the elf sections into the result EFI binary.
The binutils bug entry: https://sourceware.org/bugzilla/show_bug.cgi?id=26206
Cool, I hadn't found the bug report yet and was just thinking about adding one myself
@steve-mcintyre
Is there a concise technical description of the missing pieces in the toolchain? I know some of the Arm folks working in that area and would be happy to convey a detailed request for assistance.
Hi @vielmetti LTNS! I've just mailed the linaro-toolchain list about this as well, with a reference to the binutils bug that @lcp referenced. Waiting on that showing up in the archives...
Thanks @steve-mcintyre - relayed this to some Arm folks who may be able to get some additional eyes on task.
@dannf @steve-mcintyre et al - I'm coming back to this problem after about a month away - there has clearly been some progress in resolving this, but I'm not sure where the current discussion lives (or if it's been summarized recently).
My goal is to provide direction for our team that builds Ubuntu and Debian images to make sure they pick up the right versions of the right components so that GRUB works correctly on Ubuntu 20.04 / Debian 10 on arm64 server systems (specifically the Ampere systems at Equinix Metal).
@dannf @steve-mcintyre et al - I'm coming back to this problem after about a month away - there has clearly been some progress in resolving this, but I'm not sure where the current discussion lives (or if it's been summarized recently).
My goal is to provide direction for our team that builds Ubuntu and Debian images to make sure they pick up the right versions of the right components so that GRUB works correctly on Ubuntu 20.04 / Debian 10 on arm64 server systems (specifically the Ampere systems at Equinix Metal).
For Ubuntu this has been resolved in shim 15.4-0ubuntu7 - see https://launchpad.net/bugs/1928010.
I've been trying to track down an issue with Ubuntu's shim, that I've now found is reproducible with latest upstream shim (as of commit 4068fd42c891ea6ebdec056f461babc6e4048844) as well.
If I put a guest in a reboot loop, it will eventually crash:
After it has crashed once, it will enter a watchdog reset/shim crash infinite loop.
I attached gdb to the QEMU process, and I'm pretty confident that it is crashing in
_relocate()
- specifically here:Object file layouts are out of my wheelhouse, but I did try to see if I could understand what is going on. First off, it seems odd that
_relocate()
is finding any relocations to apply in the first place. If I understand this correctly, this is just agnu-efi
's generic loader for EFI binaries, and executables shouldn't have relocations. Indeed, objdump shows no relocations in the ELF shim binary (I don't know how to scan the PE/COFF one).The faulting address appears to be
0x00000000BB9752C8
, which should fall within the.text
section. If I understand this correctly, it's setting up the attributes for the text and data sections:The text section would be the 2nd range there, which has the READONLY attribute set. So, that explains why we're getting the exception. But if relocations need to occur, seems like this would need to be writeable though, again, it doesn't seem like an executable would need relocations...