lategoodbye / rpi-zero

Linux kernel source tree
Other
22 stars 3 forks source link

Boot hang (with Firmware transaction timeout error) on Raspberry Pi 4B #52

Open wizeman opened 3 years ago

wizeman commented 3 years ago

I have a Raspberry Pi 4B (8 GB model) which hangs at boot with the following errors (apologies for the photo, but I don't have a serial console at hand):

IMG_20210722_180151

This happens 100% of the time with both mainline kernels 5.10.52 and 5.13.4.

The only connected peripherals are an SSD (which is the boot disk) connected over a Startech SATA-USB 3.1 adapter cable, an ethernet cable, HDMI adapter and the official Raspberry Pi power brick. No SD card or other USB devices (such as a keyboard) are connected.

I've tested disconnecting either the network cable or the HDMI adapter but it still hangs, 100% of the time (as far as I can tell - although I'm not 100% certain without the HDMI output).

Interestingly, I have another Raspberry Pi 4B which also has 8 GB of RAM and is the exact same board revision (0xd03114) which boots perfectly fine, 100% of the time, with the exact same power brick and peripherals attached (including the exact same USB disk with the exact same contents).

This would indicate that there is a hardware problem, however (and quite surprisingly!) Raspberry Pi's kernel 5.10.52 does boot without any issues, 100% of the time. I have also booted an official image of Rasperry Pi OS (raspbian?), which I assume uses Raspberry Pi's kernel, and it also booted many times without any problems.

Do you know if there is a quicker way to find out what's going on without having to bisect kernels (which would take a long time given the way kernels are built on NixOS) and without using a serial console?

lategoodbye commented 3 years ago

@wizeman A firmware transaction timeout indicate an issue with the Videocore firmware. So i don't believe bisecting the kernel will help to narrow down this issue. I suggest to enable debug symbols for stacktraces CONFIG_KALLSYMS.

Another idea is build the mainline kernel and replace it on a Raspberry Pi OS image. Just to see if it's reproducible there.

At least it would be helpful to know the kernel config.

wizeman commented 3 years ago

Here's a stack trace with CONFIG_KALLSYMS enabled:

IMG_20210726_024800

Here's the full kernel config corresponding to the kernel in the above screenshot:

config.txt

@lategoodbye Note that my main testing setup is the NixOS minimal installation aarch64 image, which is almost 100% reproducible. My latest tests were with Raspberry Pi firmware release 1.20210527 (the latest stable release) and with the latest stable EEPROM image. The wireless firmware is older, since I haven't updated it.

Here's how it went:

  1. Booting installation image with mainline kernel 5.10.52 fails as above.
  2. Booting installation image with mainline kernel 5.13.4 fails as above.
  3. Installation image with Raspberry Pi kernel 5.10.52 boots successfully.

Note that the change from 1 -> 2 and from 2 -> 3 is literally a 1-line code change (to specify which kernel to use). Everything else in the installation image is exactly the same, as well as the hardware. However, some kernel options do change between kernels because of different upstream defaults, no-longer existing kernel config options or new kernel options, depending on which kernel is being used. All kernels are built from scratch by the NixOS build system (to ensure reproducibility).

Let me know if any more info would be helpful.

Thanks!

wizeman commented 3 years ago

I've corrected my comments to indicate that I'm actually using 5.x.x kernels, not 4.x.x. Sorry for any possible confusion...

lategoodbye commented 3 years ago

I never worked with NixOS before. According to the RPi 4 instructions there are two possible images (generic or new kernel), which one do you use?

Does the issue also occur on SD card boot?

wizeman commented 3 years ago

I never worked with NixOS before. According to the RPi 4 instructions there are two possible images (generic or new kernel), which one do you use?

I was using a customized image, which shares most of the configuration with NixOS's generic image, except it boots directly to the Linux kernel rather than using u-boot or the ARM stub. This is very similar to the configuration I use on my other Raspberry Pi 4s. First I tried using kernel 5.10.52 (which was working fine for me) but then I switched to 5.13.4 for debugging purposes. My kernel has a few config changes, most related to kernel hardening, which I've been using for years on other machines (including the other Raspberry Pis).

I've also tried downloading and booting NixOS's official generic aarch64 generic SD image (both on a USB disk and on a SD card) but it gets stuck on the rainbow screen even though the exact same disk media boots and works fine on my identical but known-to-be-good RPi, using the exact same peripherals (disk media, HDMI adapter, power brick and ethernet cable).

Does the issue also occur on SD card boot?

Yes.

Ok, so I've been trying to debug this for days and this is what I found out:

This allows the kernel to continue booting instead of hanging.

However, the Firmware transaction timeout warning and stack trace still appears and then the boot process gets stuck when waiting for the root partition to appear because:

  1. The USB stack starts getting -110 errors as well, and no USB devices are detected, so when booting from a USB disk it doesn't become visible to the kernel.
  2. No SD card is detected, because apparently (and to me, unintuitively), the sdhci-iproc code (i.e. the emmc / SD card controller driver or what have you) only detects SD cards when CONFIG_GPIO_RASPBERRY_EXP is enabled.

I've verified that no SD card is detected when CONFIG_GPIO_RASPBERRY_EXP is disabled even on my known-to-be-good RPi, so it doesn't seem to be a problem specific to the troublesome one.

Note that, as long as CONFIG_GPIO_RASPBERRY_EXP is left enabled, all of the images I've built and booted either on a USB disk or on an SD card work perfectly fine on my known-to-be-good RPi, but the only images that work on my identical but troublesome RPi are the ones which have the Raspberry Pi kernel, for some mysterious reason.

Since I'm spending way too much of my time on this, at this point I'm just about ready to give up and stop using this recently purchased Raspberry Pi, even though it works with the official Raspberry Pi kernel, because I specifically bought it assuming I would be able to use the mainline kernel just like my other Raspberry Pi 4s...

lategoodbye commented 3 years ago

FWIW here is a short analysis from my side. This firmware transaction timeout is a warning which should never happend (no response from the VideoCore mailbox after 1 second). In most cases it's a crash of the VideoCore firmware. So it's not the mainline kernel to blame for, it only triggers this issue for unknown reasons.

wizeman commented 3 years ago

I ended up buying a new, identical RPi, which doesn't have this problem anymore (just like my known-to-be-good one).

Eventually, I would like to bisect the kernels and see exactly which commit in the Raspberry Pi kernel seems to work around this issue, but this is very low priority for me and I'm not sure when I'll be able to do that.

Feel free to close this issue if you'd like.

Thanks!