starfive-tech / VisionFive2

445 stars 79 forks source link

Getting nvme QID 2 / QID 4 timeouts during heavy use #36

Open easytarget opened 1 year ago

easytarget commented 1 year ago

I'm seeing occasional errors like:

[60386.019927] nvme nvme0: I/O 210 QID 4 timeout, completion polled
[60416.259353] nvme nvme0: I/O 195 QID 4 timeout, completion polled
[60479.458282] nvme nvme0: I/O 220 QID 2 timeout, completion polled
[60509.487744] nvme nvme0: I/O 222 QID 2 timeout, completion polled

When doing large RPM package installs, and once while rsyncing a big file tree to the VF2.

When it happens the operation hangs for some time, then you get this in the log and the operation proceeds and eventually completes successfully.

This has occurred with the stock Debian-69 and v2.10.4 releases, and I just saw it with the latest v2.11.5 release. I do not see any additional info in the error logs although there is a missing or invalid SUBNQN field message at startup, otherwise the NVMe (Patriot P300) works well.

I have a 45W Samsung wall-wart USB PD supply, and have not seen any other issues with this card (indeed, it works well apart from this).

I saw the these errors while re-installing the latest Debian packages using the provided install script. They all occurred during the final dpkg -i phase.

root@rose:~# dmesg | grep nvme
[    0.000000] Kernel command line: root=/dev/nvme0n1p4 rw console=tty0 console=ttyS0,115200 earlycon rootwait stmmaceth=chain_mod
e:1 selinux=0
[    4.440152] nvme nvme0: pci function 0001:01:00.0
[    4.451660] nvme 0001:01:00.0: enabling device (0000 -> 0002)
[    4.469585] nvme nvme0: missing or invalid SUBNQN field.
[    4.585406] nvme nvme0: allocated 64 MiB host memory buffer.
[    4.816532] nvme nvme0: 4/0/0 default/read/poll queues
[    4.854331]  nvme0n1: p1 p2 p3 p4
[    9.004199] EXT4-fs (nvme0n1p4): mounted filesystem with ordered data mode. Opts: (null). Quota mode: disabled.
[   10.105598] EXT4-fs (nvme0n1p4): re-mounted. Opts: (null). Quota mode: disabled.
[60386.019927] nvme nvme0: I/O 210 QID 4 timeout, completion polled
[60416.259353] nvme nvme0: I/O 195 QID 4 timeout, completion polled
[60479.458282] nvme nvme0: I/O 220 QID 2 timeout, completion polled
[60509.487744] nvme nvme0: I/O 222 QID 2 timeout, completion polled
root@rose:~# nvme list
Node                  Generic               SN                   Model                                    Namespace Usage                      Format           FW Rev
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          /dev/ng0n1            P300ABBB22111821071  Patriot M.2 P300 256GB                   1       256.06  GB / 256.06  GB    512   B +  0 B   V0513A0
root@rose:~# uname -a
Linux rose.easytarget.org 5.15.310 #1 SMP Mon Mar 27 19:55:33 CEST 2023 riscv64 GNU/Linux

I build my own kernel to add USB serial and additional Bluetooth adaptors:

easytarget commented 1 year ago

See also here: https://forum.rvspace.org/t/nvme-i-o-timeouts/1545/ I'm going to try the workarounds suggested there and report back here.

theotheroracle commented 1 year ago

using kernel 6.5.0-rc1 and getting many i/o timeouts in addition to reads being very slow, 85mb/s

ThomasKorimort commented 1 year ago

What is the modifications of the StarFive upstream kernel 5.15.0 to make it faster than that? Maybe there is some problem with a native GNU/Linux 6.5 kernel running in Supervisor mode of the JH7110 (i have read that the JH7110 has secure boot feature (https://forum.rvspace.org/t/crosscompile-or-not/3310/13))? Some stealth processes running slowing down resource access? Or simply an unoptimized 6.5 driver?