mtk-openwrt / arm-trusted-firmware

Read-only mirror of Trusted Firmware-A
https://developer.trustedfirmware.org/dashboard/view/6/
Other
21 stars 20 forks source link

bl2 hang on mt7981 with BOOT_DEVICE=emmc (clang vs gcc) #8

Open arachsys opened 2 months ago

arachsys commented 2 months ago

I spotted something interesting while compiling a bootloader and trusted-firmware for an mt7981 device, but haven't been able to debug it yet. I'll have a go at digging into it over the weekend if I get a chance.

Like ARM upstream, mtk-openwrt atf's Makefile includes support for building with clang as well as gcc.

bl31 + u-boot work fine when built with clang, as does bl2 with BOOT_DEVICE=ram RAM_BOOT_UART_DL=1. However, built with clang and BOOT_DEVICE=emmc, bl2 hangs just before locating the fip partition:

NOTICE:  BL2: v2.10.0  (release):f9e4748a-dirty
NOTICE:  BL2: Built : 10:00:11, Jun  6 2024
INFO:    BL2: Doing platform setup
NOTICE:  WDT: Cold boot
NOTICE:  WDT: disabled
NOTICE:  EMI: Using DDR4 settings
NOTICE:  EMI: Detected DRAM size: 512MB
NOTICE:  EMI: complex R/W mem test passed
NOTICE:  CPU: MT7981 (1300MHz)
INFO:    MediaTek MMC/SD Card Controller ver 20201125, eco 0
INFO:    Patching MBR num of sectors: 0xffffffff -> 0xe8ffff
[hang]

Built with gcc instead, it continues normally:

NOTICE:  BL2: v2.10.0  (release):f9e4748a-dirty
NOTICE:  BL2: Built : 09:06:56, Jun  6 2024
INFO:    BL2: Doing platform setup
NOTICE:  WDT: [40000000] Software reset (reboot)
NOTICE:  EMI: Using DDR4 settings
NOTICE:  EMI: Detected DRAM size: 512MB
NOTICE:  EMI: complex R/W mem test passed
NOTICE:  CPU: MT7981 (1300MHz)
INFO:    MediaTek MMC/SD Card Controller ver 20201125, eco 0
INFO:    Patching MBR num of sectors: 0xffffffff -> 0xe8ffff
INFO:    Located partition 'fip' at 0x680000, size 0x200000
INFO:    BL2: Loading image id 3
INFO:    Loading image id=3 at address 0x40800000
INFO:    Image id=3 loaded: 0x40800000 - 0x40808148
INFO:    BL2: Loading image id 5
INFO:    Loading image id=5 at address 0x40800000
INFO:    Image id=5 loaded: 0x40800000 - 0x40867068
NOTICE:  BL2: Booting BL31
INFO:    Entry point address = 0x43001000
INFO:    SPSR = 0x3cd
NOTICE:  BL31: v2.10.0  (release):f9e4748a-dirty
[...]

My exact build command for reproduction purposes is

make CC=clang/gcc E=0 LD=ld.lld/ld STATIC=1 \
  PLAT=mt7981 BOARD_BGA=1 DRAM_USE_DDR4=1 BOOT_DEVICE=emmc \
  USE_MKIMAGE=1 MKIMAGE=../u-boot/tools/mkimage bl2

where CC is either clang 18.1.5 or gcc aarch64 14.1.0, and LD is set to ld.lld or ld to match.

arachsys commented 2 months ago

A bit of INFO() debugging implies it locks up in load_mbr_header() in drivers/partition/partition.c. The MBR boot signature check succeeds:

    plat_patch_mbr_header(mbr_sector);

    /* Check MBR boot signature. */
    if ((mbr_sector[LEGACY_PARTITION_BLOCK_SIZE - 2] != MBR_SIGNATURE_FIRST) ||
        (mbr_sector[LEGACY_PARTITION_BLOCK_SIZE - 1] != MBR_SIGNATURE_SECOND)) {
        VERBOSE("MBR boot signature failure\n");
        return -ENOENT;
    }

and then

    tmp = (mbr_entry_t *)(&mbr_sector[MBR_PRIMARY_ENTRY_OFFSET]);

gives tmp = 0x241cae (on my board), but the attempt to dereference tmp->first_lba in the next line locks up bl2. (An INFO("Dereference %u\n", tmp->first_lba) is enough to lockup.)

Built with gcc, tmp = 0x23d5de (bl2 is slightly smaller), dereferencing tmp->first_lba succeeds, and boot continues normally.

For completeness, the hang happens when bl2 is compiled with clang and linked with binutils ld, but doesn't happen when bl2 is compiled with gcc and linked with lld, so the choice of compiler is significant and the choice of linker is not.

arachsys commented 2 months ago

Ah, I'm an idiot - it's staring me in the face. tmp is 0x241cae so tmp->first_lba will be an unaligned access. Sure enough, the obvious

diff --git a/drivers/partition/partition.c b/drivers/partition/partition.c
index 1436ddd3..6913d8bd 100644
--- a/drivers/partition/partition.c
+++ b/drivers/partition/partition.c
@@ -56,7 +56,7 @@ static int load_mbr_header(uintptr_t image_handle, mbr_entry_t *mbr_entry)
 {
        size_t bytes_read;
        int result;
-       mbr_entry_t *tmp;
+       mbr_entry_t tmp;

        assert(mbr_entry != NULL);
        /* MBR partition table is in LBA0. */
@@ -81,19 +81,19 @@ static int load_mbr_header(uintptr_t image_handle, mbr_entry_t *mbr_entry)
                return -ENOENT;
        }

-       tmp = (mbr_entry_t *)(&mbr_sector[MBR_PRIMARY_ENTRY_OFFSET]);
+       memcpy(&tmp, mbr_sector + MBR_PRIMARY_ENTRY_OFFSET, sizeof tmp);

-       if (tmp->first_lba != 1) {
+       if (tmp.first_lba != 1) {
                VERBOSE("MBR header may have an invalid first LBA\n");
                return -EINVAL;
        }

-       if ((tmp->sector_nums == 0) || (tmp->sector_nums == UINT32_MAX)) {
+       if ((tmp.sector_nums == 0) || (tmp.sector_nums == UINT32_MAX)) {
                VERBOSE("MBR header entry has an invalid number of sectors\n");
                return -EINVAL;
        }

-       memcpy(mbr_entry, tmp, sizeof(mbr_entry_t));
+       memcpy(mbr_entry, &tmp, sizeof(mbr_entry_t));
        return 0;
 }

fixes it and it boots normally. Presumably gcc's -mstrict-align papers over this UB whereas clang's doesn't.

arachsys commented 1 month ago

The bug fix is now in upstream ARM trusted-firmware-a:

https://github.com/ARM-software/arm-trusted-firmware/commit/21a77e08921a13ac4adc523a136d829333a854f1