raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.2k stars 5.02k forks source link

Raspberry Pi 5 NVMe using PCIe not working with kernel 6.6.x #5873

Open teoteo opened 10 months ago

teoteo commented 10 months ago

Describe the bug

Connecting an NVMe SSD using the Pimoroni "NVMe Base for Raspberry Pi 5" on Arch Linux ARM with kernel versions prior to 6.6.x not on versions 6.6.9 and higher.

Steps to reproduce the behaviour

Using Arch Linux (in my case AstroArch) with connected a PCIe extension board with NVMe SSD connected.

  1. Boot from release 1.7 with kernel 6.1.61-1-rpi-ARCH
  2. with command lsblk I can see the SSD
  3. update the system with update-astroarch with kernel 6.6.12-1-rpi and reboot
  4. with command lsblk I can’t see the SSD
  5. downgrade the kernel with sudo pacman -U https://alaa.ad24.cz/packages/l/linux-rpi/linux-rpi-6.1.70-1-aarch64.pkg.tar.xz and reboot
  6. with command lsblk I can see the SSD

Device (s)

Raspberry Pi 5

System

OS: Arch Linux ARM aarch64 Host: Raspberry Pi 5 Model B Rev 1.0 vcgencmd not available Kernel: Linux astroarch 6.6.12-1-rpi #1 SMP PREEMPT Wed Jan 17 19:05:39 MST 2024 aarch64 GNU/Linux

Logs

No response

Additional context

I’ve tried with different kernels and with kernels after 6.6.x (linux-rpi-16k too) can’t see the SSD. Asking on Arch Linux ARM forum don’t seems to be a modules issue.

JamesH65 commented 10 months ago

Is the drive you are using in this compatibility list? https://shop.pimoroni.com/products/nvme-base?variant=41219587178579

teoteo commented 10 months ago

It isn’t listed: it’s a Samsung 970 EVO 1TB. It woks on Raspberry Pi OS: 2024/01/05 15:57:40 Copyright (c) 2012 Broadcom version 30cc5f37 (release) (embedded) Raspberry Pi reference 2023-12-05

Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 70cd6f2a1e34d07f5cba7047aea5b92457372e05, stage4
Linux raspberrypi 6.1.0-rpi7-rpi-2712 #1 SMP PREEMPT Debian 1:6.1.63-1+rpt1 (2023-11-24) aarch64 GNU/Linux
pibenchmarks 970 EVO 2024-01-20 alle 10 29 35

but if it can help, I can try with a listed Crucial P3 4TB.

popcornmix commented 10 months ago

It isn’t listed: it’s a Samsung 970 EVO 1TB. It woks on Raspberry Pi OS:

On RPiOS can you run update to 6.6 kernel and confirm you see the same failure as you do on Arch?

teoteo commented 10 months ago

On RPiOS can you run update to 6.6 kernel and confirm you see the same failure as you do on Arch?

Updated and it works:

matteo@raspberrypi:~ $ vcgencmd version; cat /etc/rpi-issue; uname -a
2024/01/15 19:02:28 
Copyright (c) 2012 Broadcom
version f0aa0715 (release) (embedded)
Raspberry Pi reference 2023-12-05
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 70cd6f2a1e34d07f5cba7047aea5b92457372e05, stage4
Linux raspberrypi 6.6.12-v8-16k+ #1723 SMP PREEMPT Thu Jan 18 13:00:39 GMT 2024 aarch64 GNU/Linux
matteo@raspberrypi:~ $ lsblk 
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
loop0         7:0    0  93.3M  1 loop /snap/core/16204
sda           8:0    1     0B  0 disk 
mmcblk0     179:0    0  58.9G  0 disk 
├─mmcblk0p1 179:1    0   512M  0 part /boot/firmware
└─mmcblk0p2 179:2    0  58.4G  0 part /
nvme0n1     259:0    0 931.5G  0 disk 
├─nvme0n1p1 259:1    0   256M  0 part 
└─nvme0n1p2 259:2    0 117.5G  0 part 

Do you have suggestions for Arch Linux ARM guys to better investigate the issue?

graysky2 commented 10 months ago

Could the issue be that other package is not providing something needed? ArchARM ships RPiOS bcm2712_defconfig as our baseline config. We add a few modules and settings but the basis set is there. I do not have NVMe hardware for testing on my RPi5/ArchARM and have no experience using these devices.

  1. Let's rule out that something in the Arch ARM kernel config modifications is causing this.
    • I built a version of the linux-rpi-16k kernel package using the .config generated by make bcm2712_defconfig (for reference full .config) without any modifications which should be the same one as RPiOS is using. Download the kernel package from here.
    • Does booting into that kernel allow the device to work?
teoteo commented 10 months ago
  • Does booting into that kernel allow the device to work?

Yes, with that kernel the device works:

~ » uname -a; lsblk                                                                              astronaut@astroarch
Linux astroarch 6.6.12-2-rpi-16k #1 SMP PREEMPT Sun Jan 21 11:48:27 EST 2024 aarch64 GNU/Linux
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
mmcblk0     179:0    0  29.1G  0 disk 
├─mmcblk0p1 179:1    0   256M  0 part /boot
└─mmcblk0p2 179:2    0  28.9G  0 part /
nvme0n1     259:0    0 931.5G  0 disk 
├─nvme0n1p1 259:1    0   256M  0 part 
└─nvme0n1p2 259:2    0 117.5G  0 part 
seamusdemora commented 10 months ago

@teoteo : So glad you've opened this issue... my NVMe hardware (also fm Pimoroni) is to be delivered this week!

graysky2 commented 10 months ago

Yes, with that kernel the device works:

OK, so something in the options we add to bcm2712_defconfig is causing the problem. The added options are here. At first glance, I did not see anything obviously to blame. @popcornmix @pelwell - do you have any thoughts?

The modified .config is generated like this:

    make bcm2712_defconfig
    cat archarm.diffconfig >> .config
    make oldconfig

For reference, the following commit illustrates the fully expanded .config highlighting the difference between RPiOS's bcm2712_defconfig and the one that is apparently to blame for the reported issue: https://github.com/graysky2/PKGBUILDs/commit/4846daab5d21f5bfcd90bc54beea23f784a28cf7

Again, I am not seeing any smoking gun. Grateful for any thought you may have.

P33M commented 10 months ago

I've just hit this with rpi-update - compare pull/5882 with 6.1.74 -> bootloader probes NVMe, Linux fails to establish a link. On 6.1.74, Linux probes fine.

P33M commented 10 months ago

Reverting this hunk which changed between 6.1 and 6.6 fixes things for me.

@@ -393,42 +395,34 @@ static u32 brcm_pcie_mdio_form_pkt(int port, int regad, int cmd)
 /* negative return value indicates error */
 static int brcm_pcie_mdio_read(void __iomem *base, u8 port, u8 regad, u32 *val)
 {
-       int tries;
        u32 data;
+       int err;

        writel(brcm_pcie_mdio_form_pkt(port, regad, MDIO_CMD_READ),
                   base + PCIE_RC_DL_MDIO_ADDR);
        readl(base + PCIE_RC_DL_MDIO_ADDR);
-
-       data = readl(base + PCIE_RC_DL_MDIO_RD_DATA);
-       for (tries = 0; !MDIO_RD_DONE(data) && tries < 10; tries++) {
-               udelay(10);
-               data = readl(base + PCIE_RC_DL_MDIO_RD_DATA);
-       }
-
+       err = readl_poll_timeout_atomic(base + PCIE_RC_DL_MDIO_RD_DATA, data,
+                                       MDIO_RD_DONE(data), 10, 100);
        *val = FIELD_GET(MDIO_DATA_MASK, data);
-       return MDIO_RD_DONE(data) ? 0 : -EIO;
+
+       return err;
 }

 /* negative return value indicates error */
 static int brcm_pcie_mdio_write(void __iomem *base, u8 port,
                                u8 regad, u16 wrdata)
 {
-       int tries;
        u32 data;
+       int err;

        writel(brcm_pcie_mdio_form_pkt(port, regad, MDIO_CMD_WRITE),
                   base + PCIE_RC_DL_MDIO_ADDR);
        readl(base + PCIE_RC_DL_MDIO_ADDR);
        writel(MDIO_DATA_DONE_MASK | wrdata, base + PCIE_RC_DL_MDIO_WR_DATA);

-       data = readl(base + PCIE_RC_DL_MDIO_WR_DATA);
-       for (tries = 0; !MDIO_WT_DONE(data) && tries < 10; tries++) {
-               udelay(10);
-               data = readl(base + PCIE_RC_DL_MDIO_WR_DATA);
-       }
-
-       return MDIO_WT_DONE(data) ? 0 : -EIO;
+       err = readw_poll_timeout_atomic(base + PCIE_RC_DL_MDIO_WR_DATA, data,
+                                       MDIO_WT_DONE(data), 10, 100);
+       return err;
 }

So, MDIO prodding broke - but why does RP1 still work?

pelwell commented 10 months ago

i.e. this commit: ca5dcc76314d1fa6d7307fd3b95039b08d2f2b97

popcornmix commented 10 months ago

Does the switch from -EIO to -ETIMEDOUT matter? Or is it the change in timeout duration? (I feel the replacement will timeout a fraction earlier)

P33M commented 10 months ago

I think the bug is doing readw into a u32 then testing the top bit for "done", and exiting if it's 0. This means writes done in rapid succession will collide.

pelwell commented 10 months ago

Yes, the readw looks like a typo/thinko (w = word). So you just want:

-       err = readw_poll_timeout_atomic(base + PCIE_RC_DL_MDIO_WR_DATA, data,
+       err = readl_poll_timeout_atomic(base + PCIE_RC_DL_MDIO_WR_DATA, data,
                                        MDIO_WT_DONE(data), 10, 100);
P33M commented 10 months ago

Yep that fixes it.

pelwell commented 10 months ago

Just push your patch.

pelwell commented 10 months ago

Include:

Fixes: ca5dcc76314d ("PCI: brcmstb: Replace status loops with read_poll_timeout_atomic()")
graysky2 commented 10 months ago

Did I misread something? @teoteo - if you booted with the unmodified 6.6.12-1 (built from https://github.com/raspberrypi/linux/commit/7df1d3b44368209517fc7dc0b35ee2621a94722e with ArchARM's config) the bug was triggered yet if you booted with the kernel I prepared using that same commit yet RPiOS's config not ArchARM's config, you did not see the bug. Is this correct?

pelwell commented 10 months ago

That would have been on a 6.1 kernel, which hasn't received the offending patch.

pelwell commented 10 months ago

But then:

Updated and it works:

matteo@raspberrypi:~ $ vcgencmd version; cat /etc/rpi-issue; uname -a
2024/01/15 19:02:28 
Copyright (c) 2012 Broadcom
version f0aa0715 (release) (embedded)
Raspberry Pi reference 2023-12-05
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 70cd6f2a1e34d07f5cba7047aea5b92457372e05, stage4
Linux raspberrypi 6.6.12-v8-16k+ #1723 SMP PREEMPT Thu Jan 18 13:00:39 GMT 2024 aarch64 GNU/Linux

I can't explain that.

graysky2 commented 10 months ago

@pelwell - if you see here, @teoteo is using the 6.6.12 kernel built from that commit....

In any case, I am glad you guys got this fixed. I updated ArchARM's kernels/should build and hit a mirror soon.

MattBlack85 commented 10 months ago

I would like to thank anyone here. Really appreciate this has been resolved thanks to collective efforts!

teoteo commented 10 months ago

I did some testing and it works on my system. Thank you all.

~ » uname -a; lsblk                                                                                                                                        
Linux astroarch 6.6.13-2-rpi #1 SMP PREEMPT Tue Jan 23 13:37:27 MST 2024 aarch64 GNU/Linux
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
mmcblk0     179:0    0 116.2G  0 disk 
├─mmcblk0p1 179:1    0   256M  0 part 
└─mmcblk0p2 179:2    0 115.9G  0 part 
nvme0n1     259:0    0 931.5G  0 disk 
├─nvme0n1p1 259:1    0   256M  0 part /boot
└─nvme0n1p2 259:2    0 931.3G  0 part /
malteneuss commented 6 months ago

Is this fix something that can help with the other NVMe-not-working cases for the normal RPI Debian?