raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
10.91k stars 4.9k forks source link

Asynchronous SError when booting with pci=pcie_scan_all #4914

Open qookei opened 2 years ago

qookei commented 2 years ago

Describe the bug

When booting Linux with pci=pcie_scan_all, an asynchronous SError occurs when looking at the configuration space of non-existent devices (eg. 01:01.0) during probing for devices on the PCI bus. The error appears to be caused by a timeout of the read transaction (seems like it's on the PCIe bus, as the SError indicates the system bus slave rejected the transaction).

Steps to reproduce the behaviour

  1. Add pci=pcie_scan_all to the kernel command line (and earlycon to actually see the panic).
  2. Reboot and observe the kernel panic on serial.

Device (s)

Raspberry Pi 4 Mod. B

System

OS: N/A Firmware: N/A (I think?) Kernel: Reproduced with both the Raspberry Pi kernel, and with mainline Linux 5.16.10 with a default config (besides compiling the PCIe controller driver into the kernel and not as a module)

Logs

Kernel log with panic when booting mainline Linux 5.16.10: https://pastebin.com/BFVGCMTZ

Additional context

I've talked to James Quinlan from Broadcom about this, and he was able to reproduce this on his RPi4, but not on a board with the STB version of the SoC. Presumably this also affects the RPi CM4 and RPi400, although I don't own either of those, so I can't verify that.

I am guessing this is actually an issue with the board itself and not the PCIe block in the SoC (as James mentioned it's the same on both versions of the SoC), or the Linux driver, but I was unsure where to report this (nor am I knowledgeable enough to be certain this is the case).

P33M commented 2 years ago

I don't think this is fixable. Error states for config reads (completion timeout, or unsupported request) from the PCIe core propagate back as an AXI error response. Bus errors are generally quite disruptive.

There are two bits I can see in the register map that should filter error response behaviours out, in particular in PCIE_MISC_MISC_CTRL.CFG_READ_UR_MODE and PCIE_MISC_UBUS_CTRL.PCIE_REPLY_ERR_DIS - we set the first in the pcie-brcmstb driver but not the second.

If I set the second, it seems to have no effect, which tallies with our (BCM2711) documentation that says the register is unused.

P33M commented 2 years ago

I was a bit suspicious of the timeout interval - 10.8 ish seconds - and the only counters that are large enough are the RGR bridge control registers (aka plumbing that connects the PCIe register interface to APB) and another unused UBUS register. The RGR bridge timeout value is sufficently large that when multiplied by a reasonable bus clock frequency you get "several seconds".

There are two bits that control bridge responses to timeouts - one of these should disable responding with an error, but neither have any effect. Changing the timeout also doesn't change the length of time between the last console messages and the panic.