raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.03k stars 4.95k forks source link

RPi5 6.9+ NVME boot regression #6321

Open leezu opened 4 weeks ago

leezu commented 4 weeks ago

Describe the bug

Linux 6.9 c64bb7e634c8aa4ae9ebb059d6a030e48ae78f89, 6.10 084962e99477691e0abe232731e3a24b14b05147 and 6.11 (based on local rebase of rpi-6.11.y branch on 6.11-rc4), RPi5 with X1001 and NVME SSD initramfs fails to boot with attached stacktrace. The issue does not occur with 6.7 (777eaeedd460453f8f2a66643e7a4dea62286dd4) and 6.8 (ea34d5aaf9ff20566d8b13e3c63a4b1d0a86a147) as built by me according to the same procedure.

image

Steps to reproduce the behaviour

Build / install procedure on RPi OS 64 bit.

git clean -ffxd
KERNEL=kernel8 make bcm2711_defconfig
KBUILD_BUILD_TIMESTAMP='' KERNEL=kernel8 make -j4 bindeb-pkg
cd ..
sudo dpkg -i linux-headers-$KVER*.deb linux-image-$KVER*.deb
sudo cp /boot/vmlinuz-$KVER-v8+ /boot/firmware/kernel8.img.test
sudo cp /boot/initrd.img-$KVER-v8+ /boot/firmware/initramfs8.test
sudo reboot "0 tryboot"

with tryboot setup as

$ cat /boot/firmware/autoboot.txt
[all]
tryboot_a_b=1
[tryboot]
tryboot_a_b=0
$ cat /boot/firmware/tryboot.txt
# For more options and information see
# http://rptl.io/configtxt
# Some settings may impact device functionality. See link above for details

# Uncomment some or all of these to enable the optional hardware interfaces
#dtparam=i2c_arm=on
#dtparam=i2s=on
#dtparam=spi=on

# Enable audio (loads snd_bcm2835)
dtparam=audio=on

# Additional overlays and parameters are documented
# /boot/firmware/overlays/README

# Automatically load overlays for detected cameras
camera_auto_detect=1

# Automatically load overlays for detected DSI displays
display_auto_detect=1

# Automatically load initramfs files, if found
auto_initramfs=1

# Enable DRM VC4 V3D driver
dtoverlay=vc4-kms-v3d
max_framebuffers=2

# Don't have the firmware create an initial video= setting in cmdline.txt.
# Use the kernel's default instead.
disable_fw_kms_setup=1

# Run in 64-bit mode
arm_64bit=1

# Disable compensation for displays with overscan
disable_overscan=1

# Run as fast as firmware / board allows
arm_boost=1

[cm4]
# Enable host mode on the 2711 built-in XHCI USB controller.
# This line should be removed if the legacy DWC2 controller is required
# (e.g. for USB device mode) or if USB support is not required.
otg_mode=1

[cm5]
dtoverlay=dwc2,dr_mode=host

[all]
kernel=kernel8.img.test
initramfs initramfs8.test followkernel

Device (s)

Raspberry Pi 5

System

cat /etc/rpi-issue Raspberry Pi reference 2024-07-04 Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 48efb5fc5485fafdc9de8ad481eb5c09e1182656, stage2

sudo vcgencmd version 2024/07/30 15:25:46 Copyright (c) 2012 Broadcom version 790da7ef (release) (embedded)

uname -a Linux raspberrypi 6.8.12-v8+ #1 SMP PREEMPT Wed Aug 21 20:43:18 UTC 2024 aarch64 GNU/Linux

Logs

No response

Additional context

No response

leezu commented 1 week ago

@pelwell any ideas on the cause for this regression starting your Linux 6.9+ branches? Or suggestions how to debug this?

pelwell commented 1 week ago

Based on the call stack and the register content, it's failing ungracefully initialising one of RP1s clocks because another clock of that name already exists. This should never happen.

The list of RP1 clock names is:

pll_sys_core
pll_audio_core
pll_video_core
pll_sys
pll_audio
pll_video
pll_sys_pri_ph
pll_audio_pri_ph
pll_video_pri_ph
pll_sys_sec
pll_audio_sec
pll_video_sec
pll_audio_tern
clk_sys
clk_slow_sys
clk_uart
clk_eth
clk_pwm0
clk_pwm1
clk_audio_in
clk_audio_out
clk_i2s
clk_mipi0_cfg
clk_mipi1_cfg
clk_eth_tsu
clk_adc
clk_sdio_timer
clk_sdio_alt_src
clk_gp0
clk_gp1
clk_gp2
clk_gp3
clk_gp4
clk_gp5
clk_vec
clk_dpi
clk_mipi0_dpi
clk_mipi1_dpi
clksrc_mipi0_dsi_byteclk
clksrc_mipi1_dsi_byteclk
  1. Do any of those match a clock that might be created by some other part of your system, perhaps under the control of an overlay?

  2. Have you tried using the same kernels on (say) an RPi OS image without the initramfs?

Note that the RP1 clock driver has not changed between rpi-6.6.y and rpi-6.11.y, so there is no immediately obvious reason why it has started failing for you.

By the way, I recommend getting some kind of serial cable (e.g. our Debug Probe) as a better way to debug boot failures.

leezu commented 5 days ago

@pelwell Thank you for taking a look.

Do any of those match a clock that might be created by some other part of your system, perhaps under the control of an overlay?

There are no files beginning with pll or clk in /boot/overlays/. I haven't modified the overlays AFAIK. Please let me know how I can check for "clock that might be created by some other part of your system"?

Have you tried using the same kernels on (say) an RPi OS image without the initramfs?

Yes, in the past I was able to run kernels 6.9 and newer without issues, albeit while booting from sdcard and without initramfs. Today I also validated, that by placing the 6.11-rc7 kernel and initramfs on the sdcard's boot partition and specifying it in the sdcard's config.txt via the kernel and initramfs options as well adjusting the cmdline.txt on the sdcard to reference the nvme partition as root, I am able to boot the initramfs successfully, have it mount the nvme root and boot up.

To summarize, the same initramfs boots correctly when placed on sdcard, but fails with above issue if placed on nvme. For Linux <6.9, the initramfs works fine even when placed on the nvme.

Please let me know if I can help provide further information for debugging this.

pelwell commented 4 days ago

Pull request #6363 adds some kernel logging to the RP1 clock driver, and hopefully stops it crashing when registration fails. After about 40 minutes you should be able to run sudo rpi-update pulls/6363 to install a trial kernel - let me know what happens when you try running it.

leezu commented 3 days ago

Thank you, @pelwell. I've applied the patch from #6363 locally and rebuilt the kernel. Your PR is to rpi-6.6.y, which does not exhibit the problem in the first place and thus I doubt that rpi-update would get the right kernel. Do you publish builds for all branches? Below is the debug output when booting from nvme. The boot indeed succeeds with your patch, though the fan speed controller seems broken (and fan remains at full speed after boot). Booting with the previously mentioned sdcard workaround does not exhibit fan speed issues.

[    2.566483] Registered RP1 clock 'pll_sys_core'
[    2.571443] Registered RP1 clock 'pll_audio_core'
[    2.576556] Registered RP1 clock 'pll_video_core'
[    2.581654] Registered RP1 clock 'pll_sys'
[    2.586128] Registered RP1 clock 'pll_audio'
[    2.590781] Registered RP1 clock 'pll_video'
[    2.595421] Registered RP1 clock 'pll_sys_pri_ph'
[    2.600503] Registered RP1 clock 'pll_audio_pri_ph'
[    2.605767] Registered RP1 clock 'pll_sys_sec'
[    2.610584] Registered RP1 clock 'pll_audio_sec'
[    2.615581] Registered RP1 clock 'pll_video_sec'
[    2.620569] Registered RP1 clock 'clk_sys'
[    2.625039] Registered RP1 clock 'clk_slow_sys'
[    2.629940] Registered RP1 clock 'clk_uart'
[    2.634504] Registered RP1 clock 'clk_eth'
[    2.638976] Registered RP1 clock 'clk_pwm0'
[    2.643538] Registered RP1 clock 'clk_pwm1'
[    2.648100] Registered RP1 clock 'clk_audio_in'
[    2.653003] Registered RP1 clock 'clk_audio_out'
[    2.657994] Registered RP1 clock 'clk_i2s'
[    2.662448] Registered RP1 clock 'clk_mipi0_cfg'
[    2.667425] Registered RP1 clock 'clk_mipi1_cfg'
[    2.672398] Registered RP1 clock 'clk_eth_tsu'
[    2.677205] Registered RP1 clock 'clk_adc'
[    2.681660] Registered RP1 clock 'clk_sdio_timer'
[    2.686713] Registered RP1 clock 'clk_sdio_alt_src'
[    2.691957] Registered RP1 clock 'clk_gp0'
[    2.696399] Registered RP1 clock 'clk_gp1'
[    2.700864] Registered RP1 clock 'clk_gp2'
[    2.705294] Registered RP1 clock 'clk_gp3'
[    2.709718] Registered RP1 clock 'clk_gp4'
[    2.714140] Registered RP1 clock 'clk_gp5'
[    2.718552] Registered RP1 clock 'clk_vec'
[    2.722964] Registered RP1 clock 'clk_dpi'
[    2.727365] Registered RP1 clock 'clk_mipi0_dpi'
[    2.732294] Registered RP1 clock 'clk_mipi1_dpi'
[    2.737204] Registered RP1 clock 'pll_video_pri_ph'
[    2.742640] Registered RP1 clock 'pll_audio_tern'
[    2.747605] Failed to register RP1 clock 'clksrc_mipi0_dsi_byteclk' - err -17
[    2.755021] Failed to register RP1 clock 'clksrc_mipi1_dsi_byteclk' - err -17

I'm not sure what the clksrc clocks are and whether they are expected to show up in the devicetree on the booted system, but dtc -I fs /sys/firmware/devicetree/base | grep clksrc | wc -l prints 0 when executed on the kernel booted with the sdcard workaround. Grepping for clk yields 44.