raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.05k stars 4.96k forks source link

PCI Express regression on CM4 / CM4 IO Board on stable_20231004 (6.1.54, Bookworm) #5659

Open julienrobin28 opened 11 months ago

julienrobin28 commented 11 months ago

Describe the bug

Hi,

I switched from Raspberry Pi OS version 11 to version 12 on my Compute Module 4 IO Board, and noticed the SATA PCIe adapter I'm using now causes an almost consistent crash while booting the OS, when PCIe starts running.

The issue is encountered with pcieport driver, on 6.1.0-rpi4-rpi-v8 kernel (6.1.54)

After lot of retries, I had it booting once with the SATA card present without panicking, so I took the "dmesg" output, here are the interesting lines to see:

[    1.406468] pcieport 0000:00:00.0: enabling device (0000 -> 0002)
[    1.406725] pcieport 0000:00:00.0: PME: Signaling with IRQ 31
[    1.407266] pcieport 0000:00:00.0: AER: enabled with IRQ 31
[    1.407983] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    1.408041] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[    1.408071] pcieport 0000:00:00.0:   device [14e4:2711] error status/mask=00000080/00002000
[    1.408103] pcieport 0000:00:00.0:    [ 7] BadDLLP               
[    1.408239] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    1.408288] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[    1.408316] pcieport 0000:00:00.0:   device [14e4:2711] error status/mask=00000080/00002000
[    1.408345] pcieport 0000:00:00.0:    [ 7] BadDLLP               
[    1.408720] simple-framebuffer 3e3cf000.framebuffer: framebuffer at 0x3e3cf000, 0x7f8000 bytes
[    1.408728] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    1.408754] simple-framebuffer 3e3cf000.framebuffer: format=a8r8g8b8, mode=1920x1080x32, linelength=7680
[    1.408799] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[    1.408837] pcieport 0000:00:00.0:   device [14e4:2711] error status/mask=00000040/00002000
[    1.408866] pcieport 0000:00:00.0:    [ 6] BadTLP                
[    1.408904] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    1.408941] pcieport 0000:00:00.0: AER: can't find device of ID0000
[    1.409039] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[    1.409069] pcieport 0000:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[    1.409084] pcieport 0000:00:00.0:   device [14e4:2711] error status/mask=00000080/00002000
[    1.409099] pcieport 0000:00:00.0:    [ 7] BadDLLP               
[    1.409863] pcieport 0000:00:00.0: AER: Corrected error received: 0000:00:00.0
[...]

However when kernel panic occurs, rootfs doesn't seem to be ready so nothing is logged into it (so here is an old fashioned picture of my screen attached). IMG_20231018_014932

When the SATA card is missing, no error occurs.

I tried to put back the 6.1.21 kernel (and associated modules, overlays, dtb) from Raspberry Pi OS 11 to my new Raspberry Pi OS 12 installation, and everything was back to working as before, confirming the issue only is about the kernel.

Having the "initramfs" loaded or not has no effect.

Comparing the kernel config from previous kernel and new kernel shows that the new kernel config is now enabling PCIe related additional features, at least the following options:

When rebuilding the new 6.1.54 kernel with the old 6.1.21 kernel config, it works fine (no "pcieport" driver issue as it isn't enabled in the kernel config)

However, I'm not sure why this "pcieport" issue occurs, and currently, I don't have any other PCIe card to be tried on it, unfortunately.

Steps to reproduce the behaviour

Boot the new Raspberry Pi OS 12 on a Compute Module 4 IO Board, with the following PCIe card:

SATA controller [0106]: ASMedia Technology Inc. ASM1166 Serial ATA Controller [1b21:1166] (rev 02)

(I don't know if the issue also occurs with others PCIe cards).

Device (s)

Raspberry Pi CM4

System

Raspberry Pi reference 2023-10-10 Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 962bf483c8f326405794827cce8c0313fd5880a8, stage2

Aug 10 2023 15:33:38 Copyright (c) 2012 Broadcom version 03dc77429335caee083e22ddc8eec09c07f12a7a (clean) (release) (start)

Linux crobe-server-coudray 6.1.0-rpi4-rpi-v8 #1 SMP PREEMPT Debian 1:6.1.54-1+rpt2 (2023-10-05) aarch64 GNU/Linux

Logs

PCI-Express-BUG-6.1.54-dmesg.txt PCI-Express-BUG-6.1.54-lspci-nn-vvv.txt PCI-Express-OK-6.1.21-lspci-nn-vvv.txt

Additional context

EDIT from 2023-10-18 in the evening:

I found a workaround to work without changing the kernel, to avoid kernel panics by looking at available command-line parameters for Linux kernel 6.1

After having added pcie_aspm=off to /boot/firmware/cmdline.txt, I don't have kernel panics anymore. However dmesg messages about PCIe Bus Error and AER are still shown, unless pcie_ports=compat is added too.

Adding pcie_ports=compat alone, however, does not avoid kernel panics (it just removes the dmesg messages about PCIe Bus Error and AER).

Hoping this report may help,

Best regards

julienrobin28 commented 11 months ago

Hello again there,

Small update:

kernel-panic-1.txt kernel-panic-2.txt kernel-panic-3.txt kernel-panic-4.txt kernel-panic-5.txt

I now seem to have a working configuration with the workarounds. I'll keep you informed if anything else comes up, and I keep available for any additional test or information you may need.

kernel-working-log.txt

julienrobin28 commented 11 months ago

Hello, it's me again

I took some others PCIe cards to check everything; no issue with VL805 PCIe card, however I found out this regression also affects the following PCIe card:

01:00.0 USB controller [0c03]: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller [1912:0014] (rev 03)

Even having bootloader able to use the card for booting, kernel won't boot or tell anything unless pcie_aspm=off is added.

Serial output without pcie_aspm=off:

RPi: BOOTLOADER release VERSION:4fd8f1f3 DATE: 2023/05/11 TIME: 07:26:03
BOOTMODE: 0x06 partition 0 build-ts BUILD_TIMESTAMP=1683786363 serial fb2c5f5d boardrev d03141 stc 476649
PM_RSTS: 0x00001000
part 00000000 reset_info 00000000
uSD voltage 3.3V
Initialising SDRAM 'Micron' 32Gb x2 total-size: 64 Gbit 3200
DDR 3200 1 0 64 152

Boot mode: USB-MSD (04) order f2156
XHCI-STOP
xHC ver: 256 HCS: 08000820 24000011 00000000 HCC: 014051cf
USBSTS 801
xHC ver: 256 HCS: 08000820 24000011 00000000 HCC: 014051cf
xHC ports 8 slots 32 intrs 8
USB2[8] 000202e1 connected
USB2[8] 00200e03 connected enabled
USB2 root HUB port 8 init
xHC-CMD err: 4 type: 11 [01:00] 0.00 000000:08
   EVT (33   1) 10 20 f7 3f 00 00 00 00 00 00 00 04 01 84 00 01
   CMD (11   1) 00 50 f7 3f 00 00 00 00 00 00 00 00 01 2c 00 01
SLOT IN
00 00 30 08 00 00 08 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
SLOT OUT
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
EP0 CTX
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00
DEV [01:00] 2.00 000000:08 class 0 VID 090c PID 1000
MSD device [01:00] 2.00 000000:08 conf 0 iface 0 ep 81#512 02#512
MSD [01:00] 2.00 000000:08 register MSD
MSD [01:00] 2.00 000000:08 LUN 0
MSD INQUIRY [01:00] 2.00 000000:08
MSD [01:00] 2.00 000000:08 lun 0 block-count 7913472 block-size 512
MBR: 0x00002000, 1048576 type: 0x0c
MBR: 0x00000000,       0 type: 0x00
MBR: 0x00000000,       0 type: 0x00
MBR: 0x00000000,       0 type: 0x00
Trying partition: 0
type: 32 lba: 8192 oem: 'mkfs.fat' volume: ' bootfs     '
rsc 32 fat-sectors 2040 c-count 261116 c-size 4
root dir cluster 2 sectors 0 entries 0
FAT32 clusters 261116
Trying partition: 0
type: 32 lba: 8192 oem: 'mkfs.fat' volume: ' bootfs     '
rsc 32 fat-sectors 2040 c-count 261116 c-size 4
root dir cluster 2 sectors 0 entries 0
FAT32 clusters 261116
Read config.txt bytes     1292 hnd 0x1157b
Read start4.elf bytes  2254208 hnd 0x4ca5
Read fixup4.dat bytes     5403 hnd 0x129
0x00d03141 0x00000000 0x00001fff
MEM GPU: 76 ARM: 947 TOTAL: 1023
Firmware: 03dc77429335caee083e22ddc8eec09c07f12a7a Aug 10 2023 15:33:38
Starting start4.elf @ 0xfeb00200 partition 0
+

MESS:00:00:03.523186:0: USB boot mode 2
MESS:00:00:03.534716:0: brfs: File read: /mfs/sd/config.txt
MESS:00:00:03.537909:0: brfs: File read: 1292 bytes
MESS:00:00:03.563059:0: HDMI0:EDID error reading EDID block 0 attempt 0
MESS:00:00:03.567566:0: HDMI0:EDID giving up on reading EDID block 0
MESS:00:00:03.584538:0: HDMI1:EDID error reading EDID block 0 attempt 0
MESS:00:00:03.589039:0: HDMI1:EDID giving up on reading EDID block 0
MESS:00:00:03.595586:0: brfs: File read: /mfs/sd/config.txt
MESS:00:00:03.600105:0: gpioman: gpioman_get_pin_num: pin DISPLAY_SDA not defined
MESS:00:00:03.606619:0: gpioman: gpioman_get_pin_num: pin LEDS_PWR_OK not defined
MESS:00:00:03.778199:0: gpioman: gpioman_get_pin_num: pin FLASH_0_ENABLE not defined
MESS:00:00:03.782831:0: gpioman: gpioman_get_pin_num: pin FLASH_0_INDICATOR not defined
MESS:00:00:03.790572:0: gpioman: gpioman_get_pin_num: pin FLASH_0_ENABLE not defined
MESS:00:00:03.798022:0: gpioman: gpioman_get_pin_num: pin FLASH_0_INDICATOR not defined
MESS:00:00:04.103724:0: gpioman: gpioman_get_pin_num: pin LEDS_PWR_OK not defined
MESS:00:00:04.109368:0: *** Restart logging
MESS:00:00:04.112008:0: brfs: File read: 1292 bytes
MESS:00:00:04.122072:0: hdmi: HDMI0:EDID error reading EDID block 0 attempt 0
MESS:00:00:04.127102:0: hdmi: HDMI0:EDID giving up on reading EDID block 0
MESS:00:00:04.137726:0: hdmi: HDMI0:EDID error reading EDID block 0 attempt 0
MESS:00:00:04.142748:0: hdmi: HDMI0:EDID giving up on reading EDID block 0
MESS:00:00:04.148346:0: hdmi: HDMI:hdmi_get_state is deprecated, use hdmi_get_display_state instead
MESS:00:00:04.162139:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 0
MESS:00:00:04.167164:0: hdmi: HDMI1:EDID giving up on reading EDID block 0
MESS:00:00:04.177786:0: hdmi: HDMI1:EDID error reading EDID block 0 attempt 0
MESS:00:00:04.182816:0: hdmi: HDMI1:EDID giving up on reading EDID block 0
MESS:00:00:04.188413:0: hdmi: HDMI:hdmi_get_state is deprecated, use hdmi_get_display_state instead
MESS:00:00:04.197178:0: HDMI0: hdmi_pixel_encoding: 300000000
MESS:00:00:04.202646:0: HDMI1: hdmi_pixel_encoding: 300000000
MESS:00:00:05.184105:0: brfs: File read: /mfs/sd/initramfs8
MESS:00:00:05.186570:0: Loaded 'initramfs8' to 0x0 size 0xa96880
MESS:00:00:05.201217:0: initramfs loaded to 0x2e569000 (size 0xa96880)
MESS:00:00:05.204651:0: gpioman: gpioman_get_pin_num: pin CAMERA_0_I2C_PORT not defined
MESS:00:00:05.216623:0: dtb_file 'bcm2711-rpi-cm4.dtb'
MESS:00:00:05.218655:0: brfs: File read: 11102336 bytes
MESS:00:00:05.229483:0: brfs: File read: /mfs/sd/bcm2711-rpi-cm4.dtb
MESS:00:00:05.232728:0: Loaded 'bcm2711-rpi-cm4.dtb' to 0x100 size 0xd764
MESS:00:00:05.252840:0: brfs: File read: 55140 bytes
MESS:00:00:05.269663:0: brfs: File read: /mfs/sd/overlays/overlay_map.dtb
MESS:00:00:05.304016:0: brfs: File read: 4743 bytes
MESS:00:00:05.308217:0: brfs: File read: /mfs/sd/config.txt
MESS:00:00:05.311209:0: dtparam: audio=on
MESS:00:00:05.320011:0: brfs: File read: 1292 bytes
MESS:00:00:05.342466:0: brfs: File read: /mfs/sd/overlays/vc4-kms-v3d-pi4.dtbo
MESS:00:00:05.408737:0: Loaded overlay 'vc4-kms-v3d'
MESS:00:00:05.585461:0: brfs: File read: 3913 bytes
MESS:00:00:05.589297:0: brfs: File read: /mfs/sd/cmdline.txt
MESS:00:00:05.592629:0: Read command line from file 'cmdline.txt':
MESS:00:00:05.598506:0: 'console=serial0,115200 console=tty1 root=PARTUUID=7788c428-02 rootfstype=ext4 fsck.repair=yes rootwait'
MESS:00:00:05.719223:0: brfs: File read: 103 bytes
MESS:00:00:06.442941:0: brfs: File read: /mfs/sd/kernel8.img
MESS:00:00:06.445498:0: Loaded 'kernel8.img' to 0x80000 size 0x852951
MESS:00:00:07.701642:0: Kernel relocated to 0x200000
MESS:00:00:07.703491:0: Device tree loaded to 0x2e55b300 (size 0xdc16)
MESS:00:00:07.711670:0: uart: Set PL011 baud rate to 103448.300000 Hz
MESS:00:00:07.718820:0: uart: Baud rate change done...
MESS:00:00:07.720838:0:

But as told above, as soon as the option is added, boot is occurring fine (and the PCIe card is working).

[...]
MESS:00:00:05.522353:0: 'console=serial0,115200 console=tty1 root=PARTUUID=9ce70122-02 rootfstype=ext4 fsck.repair=yes rootwait pcie_aspm=off'
MESS:00:00:05.644278:0: brfs: File read: 117 bytes
MESS:00:00:06.164898:0: brfs: File read: /mfs/sd/kernel8.img
MESS:00:00:06.167456:0: Loaded 'kernel8.img' to 0x80000 size 0x852951
MESS:00:00:07.423820:0: Kernel relocated to 0x200000
MESS:00:00:07.425671:0: Device tree loaded to 0x2e55b300 (size 0xdc22)
MESS:00:00:07.433834:0: uart: Set PL011 baud rate to 103448.300000 Hz
MESS:00:00:07.441001:0: uart: Baud rate change done...
MESS:00:00:07.443020:0:[    0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd083]
[    0.000000] Linux version 6.1.0-rpi4-rpi-v8 (debian-kernel@lists.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT Debian 1:6.1.54-1+rpt2 (2023-10-05)
[    0.000000] random: crng init done
[    0.000000] Machine model: Raspberry Pi Compute Module 4 Rev 1.1
[    0.000000] efi: UEFI not found.
[    0.000000] Reserved memory: created CMA memory pool at 0x000000000e400000, size 512 MiB
[    0.000000] OF: reserved mem: initialized node linux,cma, compatible id shared-dma-pool
[...]

Note about ASM1166 firmware: It has to be tried just in case, so I managed to upgrade the firmware of my ASM1166 SATA PCIe adapter (from 200529-000D-02 to 211108-0000-00) but this doesn't solve the issue about this regression. So I'll keep using the cmdline.txt workaround!

aimbotbob commented 10 months ago

Hi, I have done the same workaround/s as you have here - I am still getting the bus errors however as of right now I have not had any Kernal panics, which i should've done by now, and 95% of errors it is correcting. I'll be honest I am in way over my head at this point but I am wondering did you experience this? Can i just go with the ignorance is bliss approach here if it isn't crashing?

This problem has been driving me crazy all week now, at this point i'm ready to just ignore the bus errors if it appears to be wri0ting data correctly.

It may be worth mentioning I am using a 6 Port card using a ASM1166 chip as well as a USB sata card with a VL805? chip, running through a PCIE switch - I believe these are functioning perfectly as I have no problems without the sata card.

EDIT: I'm a fool, I didn't add the fix to the first line in/boot/firmware/cmdline.txt - I didn't understand it had to be on the first line and all seems good for now i have added the fix properly.

julienrobin28 commented 10 months ago

Hi @aimbotbob,

Fortunately, with the workaround parameters in cmdline.txt once the system is up, I have absolutely no problem transferring data over PCIe, with a 3 disks RAID array, even after 36 TB of cumulated reading. No problem while writing even if I didn't write as much (few hundreds of GB); I also have the rootfs running through the SATA card on SSD (possible since Bookworm thanks to the initramfs containing the required dynamically loaded modules to access the SATA card - so that rebuilding the kernel with custom options isn't needed anymore to do so). I'm using it as an h24 running NAS with a WebDAV server and Syncthing running on it.

However I still had a startup crash once (on a reboot). By chance, because I was paranoid enough I still had the UART reader attached and got the kernel log 2023-10-24-startup-crash-despite-workaround.txt.

So I'm still hoping for something that would solve this PCIe issues anyway; because for now, I had to disable auto reboot into my unattended-upgrades config file just in case, so that I only reboot the NAS while I'm physically able to do a power cycle in case the reboot goes wrong 😩 considering releasing such a device is kinda unthinkable at this point... 😞

aimbotbob commented 10 months ago

@julienrobin28 If it makes you feel any better I had mine lockup lastnight, there is nothing in the logs to indicate what it is that caused it either (apart from lsyncd starting several processes)- I have been using lsyncd to back up my files, I believe it may be this may exacerbate the issue so i have limited its processes to 1 at a low bandwidth.

I have got another sata card knocking about on another chipset, which one that is i couldn't tell you off the top of my head but i have had trouble to get it powered up by the pi. I will be ordering some powered pci risers today that should (fingers crossed) fit in the enclosure i am using, would you like me to keep you posted to see if things are more stable on another chipset?

AhnFire commented 5 months ago

Sorry to drag this up, i'm getting the same with this waveshare CM4 PCIE SATA adapter I am using. Is the working simply to add the two entries to /boot/firmware/cmdline.txt ? pcie_aspm=off pcie_port_pm=off

I can navigate around tech, but I have never done something like recompiling a kernel. Sorry to drag this old thread up from the dust!

julienrobin28 commented 5 months ago

@AhnFire No problem, you're welcome, this thread is still open (I'm even still hoping that something ends up by solving this issue!)

For the related workaround, yes you can apply it by adding these parameters to the cmdline.txt file, in addition to the already existing parameters. Those are passed to the kernel at next boot (or next reboot), by the bootloader (before the kernel is running), so that the kernel starts having already received those parameters. It is applied without having to recompile the kernel.

By the way, in case you need it some day: Recompiling a kernel is only necessary when you want to enable or change something that cannot be changed at startup (cmdline) or dynamically (once the system is up and running), or even when you want to do your own changes to a driver for example (be it as simple as adding debugging messages to dmesg for example, when you really want to track or understand something), or just rebuild the same kernel but with slight changes into the kernel build config (after browsing the interactive menuconfig for example, to add disabled drivers, turn a dynamically loadable driver to a static always loaded driver, change some default behavior, etc).

Some interesting and very complete information is available here in case you want to do it some day https://www.raspberrypi.com/documentation/computers/linux_kernel.html But on the day you make your first kernel build, ensure you have enough time to take a calm and deep look to learn everything from it and overcome possible issues if you end up doing something unusual). Few years ago I felt unable to do it, but with time and work it can be learned fine.

pelwell commented 5 months ago

But on the day you make your first kernel build, ensure you have enough time to take a calm and deep look to learn everything from it and overcome possible issues if you end up doing something unusual).

I'd almost say the opposite - build your first kernel before you need to, using one of the Pi defconfig files and making no changes, just to get it out of the way and so you know that if it doesn't work that it is less likely to be your fault. Once you've got that working, feel free to change stuff.

AhnFire commented 5 months ago

Great, thank you both! :) Looking forward to trying out the workaround for now so I can get my little mini-NAS project going.

AhnFire commented 5 months ago

Update: At first I did not have a lot of success. With the Waveshare PCIE SATA card in, nearly every time I got a kernel panic or a blank screen (Raspberry Pi OS Bookworm, 64-bit lite).

After correcting a formatting error (I was using comma-separated after reading the existing entries incorrectly) and also noting there was a 3rd parameter, I had about 10 reboots that went smooth. Including 1 power down and cold boot. This is my cmdline.txt entry, if this helps anyone: pcie_ports=compat pcie_aspm=off pcie_port_pm=off console=serial0,115200 console=tty1 root=PARTUUID=2173db26-02 rootfstype=ext4 fsck.repair=yes rootwait

With the first couple of cold boots, it still went into kernel panic, but then it seemed to stabilize, I don't understand why. I have a suspicion about doing sudo shutdown now, maybe for some reason, it does not like it. I don't really understand the problem, but maybe it has to do with how power gets to the card?

Question, are others finding things more stable with Bullseye over Bookworm? Before I saw this workaround, I tried Bullseye, but I was still getting kernel panics. I have not tried since re-flashing my emmc with Bookworm and using this workaround.

Does this workaround reduce the performance of the PCI port, does anyone know?

AhnFire commented 5 months ago

Reading a little about the parameters we had to set in order to get a stable boot, I think I will lose a lot of the features that this board is supposed to give.

Onboard SATA host controller (AHCI) with upstream PCle Gen3 x1 and downstream four SATA Gen3 ports. It's a low latency, low cost and low power AHCI controller. With four SATA ports and cascaded port multipliers, it can enable users to build up various high-speed IO systems, including server, high-capacity system storage or surveillance platforms

Supports 1-ch PCI Express Four power-saving modes of L0s/L1/L23/L3 Support L1 sub-state deep power saving mode Supports SRIS, AER, LTR Supports SATA LED Supports NCQ & AHCI SPEC 1.4 Four SATA3.0 (6Gbps) ports Supports the switching based on the port multiplier commands Supports SATA Partial/Dormancy power management

https://www.waveshare.com/pcie-to-sata-4p.htm https://www.waveshare.com/wiki/PCIe-TO-SATA-4P

I wonder if this is relevant to my card (using the same chipset). This points to a general Linux MR.

https://forums.unraid.net/bug-reports/stable-releases/6129-kernel-does-not-recognize-sata-ports-on-port-multipliers-r2940/

peterdey commented 5 months ago

I'm experiencing the same issue with a ASMedia ASM1064 SATA adapter: Kernel panic with "Asynchronous SError Interrupt".

I was unsure whether this related to upstream kernel bug #217276). Tried applying Jim Quinlan's patchset (v9), but this did not reduce the probability of a kernel panic.

Chaos02 commented 4 months ago

Greetings, I'm trying to use a waveshare (Asrock) pcie switch with my cm4 but I seem to be unable to get it to work:

with the cmdline.txt options mentioned above, I only get varying success, instead of always the same message at the same point: sometimes it loads into plymouth with loaded fonts etc, sometimes it gets to check the eMMC, sometimes it fails before that.

here is the kernel config I created with make menuconfig.

peterdey commented 4 months ago

Confirming that the source of my kernel panics seems to be the config changes made in 9cfb379147f803b0362b0fe249e5b145d232bea3.

No issues at all running 406e7dc82be6ce1b81c88b418640daeef6c2be42; but with 9cfb379147f803b0362b0fe249e5b145d232bea3, I get a panic about 3 out of 4 boots.

timg236 commented 4 months ago

Does adding "pcie_aspm=off" in cmdline.txt resovle the issue?

peterdey commented 4 months ago

Adding pcie_aspm=off to cmdline.txt seems to fix it for me. 10 reboots, even with the latest 6.6.31, not a single panic yet.

CyberLeader3000 commented 1 month ago

Sorry for re-opening this old thread, however thanks for this it really helps. I had the same problem updating from PiOS 11 to PiOS 12 on my CM4 based NAS (https://www.hackster.io/cyberleader3000/nassie-raspberry-pi-home-network-attached-storage-hardware-38a258)

Using the workaround I could run PiOS 12 with the ASMedia SATA PCIe card installed. However, when I installed OMV7 (Open Media Vault https://www.openmediavault.org/) the PCIe card stopped working. The changes to cmdline.txt are still in the file but it seems like it is not being used.

I am not sure if OMV7 is booting the system differently. I have asked about it on the OMV forum but no response so far.

So I thought I might compile with kernel with the changes in it so I did not have to use cmdline.txt because it does not seem to work for OMV7.

What did you do when you compiled the kernel? Did you use "menuconfig" or edit the".config" file manually? Did you just comment out the following lines?

CONFIG_PCIEPORTBUS=y CONFIG_PCIEAER=y CONFIG_PCIEASPM_DEFAULT is not set (instead of CONFIG_PCIEASPM_DEFAULT=y) CONFIG_PCIEASPM_POWERSAVE=y CONFIG_PCIE_PME=y CONFIG_PCIE_DPC=y

Thanks!

julienrobin28 commented 1 month ago

Hi @CyberLeader3000 and sorry for the delay,

I took a look at my files about this issue, and I collected some of the related information.

Into the Raspberry Pi fork of the Linux kernel, the PCI related default options were changed from linux-1.20230405 (6.1.21) to linux-stable_20231004 (6.1.54). Into the linux-stable_20231004/arch/arm64/configs/bcm2711_defconfig file, the related options have been added (they were missing from the previous version).

bcm2711_defconfig-from-rpi-6.1.21.txt bcm2711_defconfig-from-rpi-6.1.54.txt

Taking a look with "Meld" shows the following differences about them:

Screenshot_2024-08-09_11-54-40

However, after the build configuration step, those files are used as basis to create a .config file which is way bigger than the bcm2711_defconfig file.

Meld shows the following differences about associated .config files:

Screenshot_2024-08-09_12-01-46

Depending on the build options of the kernel you are using, check about options like CONFIG_CMDLINE_FORCE (shouldn't be set), CONFIG_CMDLINE_FROM_BOOTLOADER (should by y) and CONFIG_CMDLINE (which may be set to some default options if the bootloader isn't providing a kernel command line for any reason).

If cmdline.txt file is ignored by OpenMediaVault it may be because they use another bootloader (like grub or u-boot) instead (or above) the Pi's default bootloader. You may need to figure out how to access and change their kernel command line.

Just in case: Beware y and m options for hardware support

Some of the kernel options (related to kernel modules for hardware support), may be set to m instead of y. The m option will create dynamically loadable modules (they will be placed into the rootfs) while y option will create a statically built module (which will be pre-loaded as soon as the kernel is initializing).

The m options is generally fine for everything that isn't required to access the rootfs, so that once the root file system is mounted, if required, those kernel modules will be dynamically loaded in a second time, and associated hardware will start working. Advantage, if a kernel module isn't needed, you save some RAM and avoid your system loading and initializing tons of unnecessary drivers.

But I don't remember on which versions of the Raspberry Pi kernel, SATA cards required module(s) weren't statically built, which of course is a problem if your root file system is behind the SATA card.

Good luck for your investigations!

CyberLeader3000 commented 1 month ago

Hi julienrobin28, Thanks for getting back to me no problem with the delay, it has taken me a while to do some more debugging. Big thanks for the additional information.

Debugging is a little bit hard for me because I am running headless with the "lite" version of PiOS. I need to remove the HAT and connect a UART to USB to see if I can get boot information.

I tried building a couple kernels with different configurations, however the PCIe did not work in them. :-(

I decided to start again with a new PiOS Bookworm image and then updated it. It seemed to work with the SATA card installed so I installed OMV7 and it was still booting and working with the SATA card installed. I hot plugged an SSD and it worked as well. This was looking good. I then re-booted the system and it would not boot. So it looks like it only boots if there are no discs connected to the SATA connectors. :-( It is not really practical to always hot-plug the drives after a re-boot.

My root filesystem is on the SD card just to make it easier to change and backup.

I have been using ASM 1064 SATA cards so I bought an ASM1166 like you use to see if it makes a difference. I tried the ASM 1166 and it did not work for me.

So my plan: -connect the UART for debug messages. -check messages from standard firmware. -try a couple more custom kernel builds.

I will also see if I can get any input from the Pi forums.

Thanks!

CyberLeader3000 commented 3 weeks ago

I have been doing a bit more debugging on this problem. I tried several different configurations and custom kernels but nothing worked (or at least not consistently).

It is hard to know what is happening because the NAS runs headless, so connect a UART to USB adapter and enable the port by adding enable_uart=1 and uart_2ndstage=1 to config.txt. When I added “pcie_aspm=off pcie_ports=compat” to cmdline.txt and the change to config.txt, the system started working.

It seems to work consistently with both changes. If I comment out the debug port line in config.txt the system gets a kernel panic.

The system boots and works even when the debug hardware is not connected. There seems to be 2 problems. It looks like there is a change to the PCI clkreq# modes configuration and the cmdline.txt fixes this problem. There might be a timing/synchronization problem and the writing to the debug port seems to make this work. It could be that a sync. of some sort needs to be added to the boot procedure.

While my system appears to be working at the moment, I am not sure if this is a stable long term solution.

Interesting information: -I bought an ASM1166 6 port SATA PCI card and I could not get the kernel to even start. -There seems to be a similar issue on Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=217276

aimbotbob commented 2 weeks ago

So I have just come back to this thread as frankly I got frustrated with the Pi / SATA issue and decided to put it into the fuck-it bucket for a little while, it is nice to discover that the problem has only gotten worse in my absence and my very limited Linux knowledge has only regressed in that time.

So today I have spent several hours with multiple different SATA cards (all ASM based besides one (VL805)), PCIE risers and Kernel configs and not a single bloody combination of any can even get my Pi to detect a SATA card at this point - looks like I have got similar problems to @CyberLeader3000 .

I'll spend some more time tinkering over the next few days but I am already close to my wits end with this thing already, if no one hears from me it is because I got pissed off and switched to some hardware that isn't going to make me jump through hoops just to get SATA running.

julienrobin28 commented 2 weeks ago

Hi @aimbotbob I understand and sometimes share your frustration.

I believe the PCIe bus of BCM2711 just turned out "not good enough" to many other things than the VL805 which was embedded into the Pi 4 boards. Those new enabled by default kernel features just made it more visible, instead of literally being a "regression". I don't know whose fault is it but I don't believe that was neither intentional nor planned...

There is a guy (a Geerling guy if I'm right) who try to list every PCIe devices that has been tested on Raspberry Pi devices because of the fact that a lot of them aren't working fine.

About me, I switched to Pi 5 board + 52Pi P02 PCIe X1 adapter board.

About the ASM1166, the Pi 5 and the 52Pi board, it says that it can go to PCIe Gen 3 but in reality, no. It's unstable, it should be kept to Gen 2 (which is perfectly stable).

Remaining problems:

About how these problems are frustrating:

It makes me realize how much work is provided for every devices around us (TV boards, routers, connected gadgets etc), as every problem we gets while crafting our toys are inevitably encountered by many other people on many others project and devices. While I want to see this as "the bare minimum", in reality, succeeding in being uncompromising on anything that is not working fine is very huge work, even when targeting a fixed/same usage for every customer with no updates in the future.

And I'm not even talking about feedback and unexpected issues appearing once the device is installed and serving in real life 😱 if you add the updates, unavoidable changes between OS versions, firmware, drivers, libraries, etc, and it's literally an endless work.

Using others devices, I quickly realized that still having support with latest OS and kernels, and even improvement (most of the time 😅) for boards that are more than 10 years old is almost impossible outside of Raspberry Pi Foundation. I still have some Raspberry Pi 1 working fine with both IMX219 cameras and WiFi dongles, including a completely rewritten camera stack (that I sometimes hated of course, but at the end with a lot of work I got everything eventually working 100% fine).

I even got unable to get new WiFi cards (Intel BE200) to work on PC AM4 platform because of an incompatibility between Intel BE200 and some AGESA version, and both AMD and Intel are being completely quiet for months about it, so I'm forced to note that things like this are existing even in the x86_64 PC world.

My conclusion is:

Strangely enough, despite all of this endless/exhausting stream of problem encountered with Pi devices, at the end it remains really respectable devices, while not succeeding in being as perfect as expected. Of course it's always interesting to test others products but be careful if expecting better: it's not guaranteed (at all!)

And yes, we are still going be angry against problems in the future 😁 👍 we should be prepared to keep up with this, as many of them may be solved and/or worked around. May be more work should be done globally on making existing technologies more perfect and reliable instead of creating so much new technologies above them (but it's probably out of the scope of this issue anyway, and may be out of scope for a single company).

CyberLeader3000 commented 2 weeks ago

Hi julienrobin28,

Thanks for more work on this topic. I also posted this issue on the Raspberry Pi forums (https://forums.raspberrypi.com/viewtopic.php?t=375290&e=1&view=unread#unread) and got a reply from a forum moderator. He replied but has not looked into it yet.

I tried a Pi5 + ASM1166 + 52Pi PCI board and it did not work but I did not modify config.txt. Do you know if there is a similar dtoverlay for the CM4? I will look into dtoverlay.

Are you using more than 1 drive with Pi5 + ASM1166 + 52Pi?

With the original changes to cmdline.txt, I can get CM4 + ASM1064 working with 1 SSD drive. If I add more drives it stops working. :-(

I have now soldered a debug console connector to the HAT board so I can see what is happening when it boots. I have not had a chance to look at it in detail but it looks like when it boots it does not see the HDD and must scan for it later. When it does not boot it seems to find the HDD while first boot. I need to do more investigation.

I think it looks like Jeff Geerling's PCI page has space for Pi5 information but none has been added yet. I guess he really also needs a Pi 4 Bookworm column.

geerlingguy commented 1 week ago

@CyberLeader3000 - I have things split between CM4/Pi 4 and Pi 5 (and presumably CM5 at some point), just because the physical implementation differs between BCM2711 and BCM2712. Trying to add a matrix of all distros + versions would make it a bit heavy, so I'll keep it divided by hardware only.

The GitHub issues attached to particular devices has discussion about any quirks or problems that crop up with later OS revisions, there are already a few issues like the Intel AX201/AX200 WiFi adapters where people have noticed some PCIe issues cropping up with later Pi OS releases which require workarounds (which weren't a problem in Pi OS 11).

julienrobin28 commented 1 week ago

Hi @CyberLeader3000

I re-plugged my CM4 + CM4 IO Board and tested again about the ASM1166 PCIe SATA card.

This allows me to confirm the issue is still existing, and past reported observations about this issue are still valid as of today (2024/09/13).

Nothing changed in my case; but I'm detailing everything here in case it turns to be useful to spot differences with your configuration.

The output of my vcgencmd bootloader_config

[all]
BOOT_UART=1
WAKE_ON_GPIO=1
POWER_OFF_ON_HALT=0

# Boot Order Codes, from https://www.raspberrypi.com/documentation/computers/raspberry-pi.html#BOOT_ORDER
# Try SD first (1), followed by USB PCIe (4), NVMe PCIe (6), USB SoC XHCI (5), network (2), the retry (f)
BOOT_ORDER=0xf25641

# Set to 0 to prevent bootloader updates from USB/Network boot
# For remote units EEPROM hardware write protection should be used.
ENABLE_SELF_UPDATE=1

By the way, BOOT_UART=1 option above is just about having details from the 1st stage bootloader. To be able to get every interesting boot logs from UART, namely from also the 2nd stage bootloader (the one from the SD card boot partition), and from the kernel, I added the following lines into config.txt:

enable_uart=1
uart_2ndstage=1

The default cmdline.txt is already containing console=serial0,115200. Then from a second computer with an UART TTL 3.3V adapter I used apt install tio then tio --baudrate 115200 --databits 8 --flow none --stopbits 1 --parity none /dev/ttyUSB0 to get the UART stream displayed in real time.

With the default command line I still get the kernel panics, and everything goes back to "fine" when adding pcie_aspm=off and pcie_ports=compat to /boot/firmware/cmdline.txt. When I say "fine", it's fine most of the time, completely stable once booted, but there is still rare cases of kernel panics during (re)boots (maybe 1 chance out of 50?), see https://github.com/raspberrypi/linux/issues/5659#issuecomment-1800339837

I can also confirm I can boot it with several SATA devices already connected to it (I was already using it with 4 devices in the past, 3 x SATA 12 TB HDD + 1 SATA SSD - this is still what I'm using today on the Pi 5). SATA Hot-plug also works.

Also, moving the rootfs out of the SD card (into a SATA SSD connected through the ASM1166 PCIe SATA adapter card) still works fine in my case. In order to achieve this, from another computer I did the followings:

Since Raspberry Pi OS 12 if I'm right, the boot partition (which is still on the SD Card in my case) contains an "initramfs" which contains the required kernel modules to access SATA drives behind PCIe adapters. However, keep in mind that on previous Raspberry Pi OS versions which weren't using initramfs (or on others OS which are still not using initramfs), you may be screwed to access the rootfs behind a PCIe SATA adapter, as the boot partition, and kernel statically built-in modules, aren't embedding enough drivers to access the SATA drives through PCIe (however in that case it's generally fine over an USB SATA adapter).

Note about ASM1166 firmware: My ASM1166 SATA PCIe adapter card embeds an EEPROM on it, I previously managed to upgrade the firmware (from 200529-000D-02 to 211108-0000-00) using a Windows computer and some update stuff found online. In my case it didn't change anything, but if you already tried everything, this can be tried for your card.

Note about dtoverlays: It seems that on current kernel version, there is pcie-32bit-dma for the bcm2711 only (according to source code) and pcie-32bit-dma-pi5 for the Pi 5 however most of the PCIe related options seems to be for Pi 5. A double check of every available options (from /boot/firmware/overlays/README) and files (from /boot/firmware/overlays/) with interesting names is always interesting, as for example, the eee option is marked as "Pi3B+ only" while also working on Pi 4.

Hoping this may help!

CyberLeader3000 commented 1 week ago

Hi @julienrobin28,

Thanks this really helps. I have done some more investigation as well. I bought an ASM1166 card and 52Pi board for my RPi 5 so I could test your setup as well. For the CM4 I used a 3.3v UART to USB adapter and the debug changes to config.txt

Updated fresh PiOS on SD card with dtoverlay added to config.txt running on a Raspberry Pi 5 + 52Pi PCI board. I connected 2 drives (2.5" Ironwolf SDD and 2.5" Samsung HDD).

-ASM1064, worked and ran OMV7 -ASM1166, worked and ran OMV7 It seemed to work consistently.

Updated fresh PiOC on SD with added cmdline.txt options running on a Raspberry Pi CM4 + carrier board. I connected 2 drives (2.5" Ironwolf and 2.5" Samsung HDD).

-ASM1064, worked and ran OMV7 (most of the time?) -ASM1166, worked some of the time.

When it failed to boot there were 2 failure modes. The first it just hangs:

MESS:00:00:06.995874:0: Loaded 'kernel8.img' to 0x200000 size 0x8d8bd7 MESS:00:00:08.311667:0: Device tree loaded to 0x2e1d6b00 (size 0xe414) MESS:00:00:08.317083:0: uart: Set PL011 baud rate to 103448.300000 Hz MESS:00:00:08.324159:0: uart: Baud rate change done... MESS:00:00:08.326181:0: This does not happen often but trying again seemed to fix the problem.

The second failure mode is: [ 3.825459] pci 0000:00:00.0: enabling device (0000 -> 0002) [ 3.834270] ahci 0000:01:00.0: enabling device (0000 -> 0002) [ 3.899485] SError Interrupt on CPU0, code 0x00000000bf000002 -- SError [ 3.899508] CPU: 0 PID: 121 Comm: (udev-worker) Not tainted 6.6.31+rpt-rpi-v8 #1 Debian 1:6.6.31-1+rpt1 [ 3.899522] Hardware name: Raspberry Pi Compute Module 4 Rev 1.1 (DT) [ 3.899527] pstate: 800000c5 (Nzcv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 3.899538] pc : el1_interrupt+0x20/0x68 [ 3.899565] lr : el1h_64_irq_handler+0x18/0x28 [ 3.899580] sp : ffffffc080edb5b0 [ 3.899584] x29: ffffffc080edb5b0 x28: ffffff80442cbd80 x27: ffffff804026aa80 [ 3.899603] x26: ffffffc080edbc60 x25: ffffffd5ac25c070 x24: ffffff8040bfc0c0 [ 3.899617] x23: 0000000080000005 x22: ffffffd5ac22e6b8 x21: ffffffc080edb730 [ 3.899631] x20: ffffffd5d10100e0 x19: ffffffc080edb5e0 x18: ffffffc08066bd78 [ 3.899644] x17: 0000000000000000 x16: ffffffd5d1792ca0 x15: 0000007facf9cfff [ 3.899657] x14: 018fc9985a55ae78 x13: ffffffd5d1d3cbb0 x12: 00000000fa83b2da [ 3.899670] x11: 0000000000000086 x10: 0000000000001a30 x9 : ffffffd5ac2305f0 [ 3.899684] x8 : 0101010101010101 x7 : ffffff8044ddc5e8 x6 : 0000000000000000 [ 3.899697] x5 : 0000000000000001 x4 : 0000000000000000 x3 : 0000000000000000 [ 3.899709] x2 : 0000000000000000 x1 : 00000000000000c0 x0 : ffffffc080edb5e0 [ 3.899724] Kernel panic - not syncing: Asynchronous SError Interrupt [ 3.899729] CPU: 0 PID: 121 Comm: (udev-worker) Not tainted 6.6.31+rpt-rpi-v8 #1 Debian 1:6.6.31-1+rpt1 [ 3.899741] Hardware name: Raspberry Pi Compute Module 4 Rev 1.1 (DT) [ 3.899746] Call trace: [ 3.899750] dump_backtrace+0xa0/0x100 [ 3.899762] show_stack+0x20/0x38 [ 3.899770] dump_stack_lvl+0x48/0x60 [ 3.899783] dump_stack+0x18/0x28 [ 3.899794] panic+0x330/0x398 [ 3.899809] nmi_panic+0x94/0xa0 [ 3.899821] arm64_serror_panic+0x78/0x90 [ 3.899830] do_serror+0x44/0x88 [ 3.899839] el1h_64_error_handler+0x30/0x48 [ 3.899853] el1h_64_error+0x64/0x68 [ 3.899861] el1_interrupt+0x20/0x68 [ 3.899872] el1h_64_irq_handler+0x18/0x28 [ 3.899885] el1h_64_irq+0x64/0x68 [ 3.899892] ahci_enable_ahci+0x20/0xa8 [libahci] [ 3.899931] ahci_save_initial_config+0x38/0x460 [libahci] [ 3.899962] ahci_init_one+0x350/0xda0 [ahci] [ 3.900003] pci_device_probe+0xa0/0x148 [ 3.900015] really_probe+0x150/0x2c0 [ 3.900025] __driver_probe_device+0x80/0x140 [ 3.900034] driver_probe_device+0xe0/0x170 [ 3.900042] __driver_attach+0x9c/0x1b0 [ 3.900050] bus_for_each_dev+0x80/0xe8 [ 3.900065] driver_attach+0x2c/0x40 [ 3.900072] bus_add_driver+0xec/0x1f8 [ 3.900087] driver_register+0x68/0x138 [ 3.900096] __pci_register_driver+0x54/0x68 [ 3.900105] ahci_pci_driver_init+0x30/0xff8 [ahci] [ 3.900142] do_one_initcall+0x60/0x2c0 [ 3.900152] do_init_module+0x60/0x218 [ 3.900167] load_module+0x1dd0/0x2080 [ 3.900179] init_module_from_file+0x8c/0xd8 [ 3.900193] __arm64_sys_finit_module+0x14c/0x2f8 [ 3.900207] invoke_syscall+0x50/0x128 [ 3.900222] el0_svc_common.constprop.0+0x48/0xf0 [ 3.900237] do_el0_svc+0x24/0x38 [ 3.900251] el0_svc+0x40/0xe8 [ 3.900265] el0t_64_sync_handler+0x100/0x130 [ 3.900278] el0t_64_sync+0x190/0x198 [ 3.900288] SMP: stopping secondary CPUs [ 3.900300] Kernel Offset: 0x1551000000 from 0xffffffc080000000 [ 3.900306] PHYS_OFFSET: 0x0 [ 3.900309] CPU features: 0x0,80000201,3c020000,0000421b [ 3.900318] Memory Limit: none [ 4.211322] ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]---

This error seems to happen repeatedly and it takes several tries to get it to boot again.

In general, my CM4 OMV7 with the ASM1064 is running ok, however, I am worried that after a power failure, it will not come back on. It is good when someone can cycle power but it may be a problem if no one is around.

I have not measured the time but I think one issue is that boot times appear to be longer, so sometimes I thought it did not boot but it was still booting. This is where the debug port really helps.

julienrobin28 commented 1 week ago

Hi @CyberLeader3000 and thanks for the feedback, which brings interesting details

About 1st failure mode: It turns out I can confirm I also had the "just hangs" failure mode once on the CM4 + ASM1166 + pcie_aspm=off + pcie_ports=compat, exactly the same way you had. As it happened only once I considered it may be a fluke, something unrelated, but as you also encountered it, after all, it sounds like it's related.

2024-01-25-cm4-pcie-asm1166-stuck-after-2nd-stage.txt

It was on 2024/01/25, at the end of the 2nd stage boot loader the green LED was stuck on and the UART log stopped at the exact same place.

About 2nd failure mode: And I can also confirm, as mentioned in https://github.com/raspberrypi/linux/issues/5659#issuecomment-1800339837 that I also had (in some rare cases) the second failure mode (Kernel panic ending with ---[ end Kernel panic - not syncing: Asynchronous SError Interrupt ]--- message) on the CM4 + ASM1166 + pcie_aspm=off + pcie_ports=compat.

About the probability of those failures: You seemed to be more impacted despite using the same hardware, same software and same settings, which may be an interesting observation about this issue in general: each copy of the same products may be more or less impacted, which probably means the issue is impacted by electrical parts tolerances and variations (on bcm2711 SoC, CM4 board, CM4IO board, ASM1166 board and ASM1166 chip). Having ensured the ASM1166 works fine on the Pi 5 is useful anyway.

For the ASM1064 on CM4 seeming to work fine, as you pointed out, unless it is used for a very long period (for something important enough to quickly notice when it's offline), it's hard to ensure it's always going to boot fine. When a device is risking to freeze itself during each of its reboots (after updates or even voltages drops on the electrical grid for example) most of the time when you realize it's offline you're screwed / not at side of it 😅

Anyway nice to confirm too that both for me and you, the Pi 5 isn't affected by this issue 🥳 so it's only about CM4

CyberLeader3000 commented 1 week ago

Hi @julienrobin28,

Thanks again for confirming things.

I think failure mode 1 happens less often and is easy to recover, so I am less worried about it. I think the next message should be "Booting Linux" so the transition from the boot code to Linux fails.

The second failure mode I think is more complex and timing-related. The failure happens much more often when drives are connected. Without drives connected, it does not seem to happen.

Looking at the log message, it is interesting how some of the message order and timings change. I think this might happen because of variations like the power supply ramp, PLL lock, and drives (if attached).

I don't know much about Linux internals but I know the ARM cores a bit. It looks like a System Error (SError) is getting caught by the EL1 (Exception Level 1 - Supervisor mode) interrupt handler. This is normal and what it should do. System errors are normally caused by something that can not be traced back to a source easily, things like a writeback from a cache to inaccessible memory. It looks like the original cause might be an EL0 (Exception Level 0 - User mode) interrupt in the AHCI driver that was not handled.

This is interesting but I have no idea how to investigate a potential issue with the AHCI driver.

I leave the rootfs on the SD so it is easy for me to swap OS versions. I think this is just a boot issue, but I am going to run my NAS for a longer time and see how it works.

Thanks!