nvme QID timeouts on upstream branch

Blub commented 1 year ago

I've been testing the 6.3rc based upstream branch on a VisionFive2 (1.2a), with root on NVME. Disk access regularly stalls for a bit, then I see messages like this one: [ 101.417700] nvme nvme0: I/O 897 QID 2 timeout, completion polled then things continue normally for a while. Same with different nvmes (tried an Intel Optane, and a WD RED). This does not happen with the 5.15 kernel included in the debian sdcard image.

Blub commented 1 year ago

Just tested with the updated rc4 upstream branch - same issue.

Blub commented 1 year ago

Still happening with 6.4rc1. Again, this does not happen with 5.15, so I'm pretty sure it's a driver issue, and not a power supply issue.

dpeckett commented 1 year ago

I can confirm I've reproduced the same issue but this time on the 5.15 Debian kernel. Shouldn't be any power issues here but I am using a very inexpensive nvme device.

pinkavaj commented 1 year ago

The same for me using the latest (as of 2023-07-16) SD image.

with: 0001:01:00.0 Non-Volatile memory controller: Intel Corporation SSD Pro 7600p/760p/E 6100p Series (rev 03)

some (random) excerpts from dmesg

pcie_plda 2b000000.pcie: Failed to get power-gpio, but maybe it's always on.
[    3.245160] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[    3.252518] pci 0000:00:00.0: BAR 0: no space for [mem size 0x100000000 64bit pref]
3.394655] pcie_plda 2c000000.pcie: Failed to get power-gpio, but maybe it's always on.
[    3.884976] pci 0001:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x1 link at 0001:00:00.0 (capable of 31.504 Gb/
s with 8.0 GT/s PCIe x4 link)
[    3.912991] pci 0001:00:00.0: BAR 0: no space for [mem size 0x100000000 64bit pref]
[    3.925816] pci 0001:00:00.0: BAR 0: failed to assign [mem size 0x100000000 64bit pref]
[    4.768043] nvme nvme0: 4/0/0 default/read/poll queues
[    4.773220] starfive_raxda_10inch 2-0020: dsi command return -61, mode 0

pinkavaj commented 1 year ago

Any way to help debug this? (developer, but did not touched kernel for few years ...)

pinkavaj commented 1 year ago

Also the following appears in dmesg output occasionally

[21078.151284] nvme nvme0: Abort status: 0x0

pinkavaj commented 1 year ago

RaitoBezarius commented 1 year ago

I confirm running into that issue making it impossible to use reliably NVMe on the long run because they get disconnected under high load / high write, e.g. compiling and installing a fresh OS.

pinkavaj commented 10 months ago

As a workaround I have set echo 3000 >/sys/devices/platform/soc/9c0000000.pcie/pci0001:00/0001:00:00.0/0001:01:00.0/nvme/nvme0/nvme0n1/queue/io_timeout to lower the timeout to some reasonable value (the default is 30000). Not sure what is the lowest usable value, it help a bit, but not really a solution.

starfive-tech / linux

nvme QID timeouts on upstream branch #98