NVME is not functional - Githubissues

svpcom commented 11 months ago

After boot no NVME devices found. If I trigger pci rescan via echo 1 > /sys/bus/pci/rescan NVMe device is found but non-functional:

[   46.034631] pcieport 0002:00:00.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[   46.035741] pci 0002:01:00.0: [15b7:5006] type 00 class 0x010802
[   46.036483] pci 0002:01:00.0: reg 0x10: [mem 0x00000000-0x00003fff 64bit]
[   46.037312] pci 0002:01:00.0: reg 0x20: [mem 0x00000000-0x000000ff 64bit]
[   46.039350] pci 0002:01:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0002:00:00.0 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link)
[   46.052298] pci_bus 0002:01: busn_res: [bus 01-ff] end is updated to 01
[   46.053043] pcieport 0002:00:00.0: BAR 14: assigned [mem 0xf0300000-0xf03fffff]
[   46.053764] pci 0002:01:00.0: BAR 0: assigned [mem 0xf0300000-0xf0303fff 64bit]
[   46.054522] pci 0002:01:00.0: BAR 4: assigned [mem 0xf0304000-0xf03040ff 64bit]
[   46.097927] nvme nvme0: pci function 0002:01:00.0
[   46.098444] nvme 0002:01:00.0: enabling device (0000 -> 0002)
[  107.508575] nvme nvme0: I/O 12 QID 0 timeout, completion polled
[  168.958145] nvme nvme0: I/O 8 QID 0 timeout, completion polled
[  230.391719] nvme nvme0: I/O 13 QID 0 timeout, completion polled
[  242.664114] INFO: task kworker/u8:4:167 blocked for more than 120 seconds.
[  242.664761]       Not tainted 6.1.0-12-arm64 #1 Debian 6.1.52-1
[  242.665290] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.665983] task:kworker/u8:4    state:D stack:0     pid:167   ppid:2      flags:0x00000008
[  242.666733] Workqueue: events_unbound async_run_entry_fn
[  242.667225] Call trace:
[  242.667449]  __switch_to+0xf0/0x170
[  242.667771]  __schedule+0x340/0x940
[  242.668129]  schedule+0x58/0xf0
[  242.668421]  schedule_timeout+0x14c/0x180
[  242.668785]  __wait_for_common+0xd4/0x254
[  242.669148]  wait_for_completion+0x28/0x3c
[  242.669519]  __flush_work.isra.0+0x180/0x2dc
[  242.669907]  flush_work+0x18/0x2c
[  242.670209]  nvme_async_probe+0x24/0x44 [nvme]
[  242.670636]  async_run_entry_fn+0x40/0x1e0
[  242.671010]  process_one_work+0x1f4/0x460
[  242.671375]  worker_thread+0x188/0x4e0
[  242.671715]  kthread+0xe0/0xe4
[  242.672025]  ret_from_fork+0x10/0x20
[  242.672407] INFO: task (udev-worker):706 blocked for more than 120 seconds.
[  242.673027]       Not tainted 6.1.0-12-arm64 #1 Debian 6.1.52-1
[  242.673554] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.674245] task:(udev-worker)   state:D stack:0     pid:706   ppid:277    flags:0x0000080d
[  242.674992] Call trace:
[  242.675214]  __switch_to+0xf0/0x170
[  242.675535]  __schedule+0x340/0x940
[  242.675851]  schedule+0x58/0xf0
[  242.676169]  async_synchronize_cookie_domain+0xe8/0x150
[  242.676645]  async_synchronize_full+0x20/0x30
[  242.677041]  do_init_module+0x160/0x1f4
[  242.677390]  load_module+0x1cb8/0x2220
[  242.677730]  __do_sys_finit_module+0xac/0x130
[  242.678123]  __arm64_sys_finit_module+0x28/0x34
[  242.678530]  invoke_syscall+0x78/0x100
[  242.678874]  el0_svc_common.constprop.0+0xd4/0xf4
[  242.679299]  do_el0_svc+0x34/0xd0
[  242.679601]  el0_svc+0x34/0xd4
[  242.679905]  el0t_64_sync_handler+0xf4/0x120
[  242.680298]  el0t_64_sync+0x18c/0x190
[  263.763473] systemd-journald[243]: Oldest entry in /var/log/journal/69fd1ff5339ee7c1931e562d64ac7bcc/system.journal is older than the configured file retention duration (1month), suggesting rotation.
[  263.765169] systemd-journald[243]: /var/log/journal/69fd1ff5339ee7c1931e562d64ac7bcc/system.journal: Journal header limits reached or header out-of-date, rotating.
[  291.833349] nvme nvme0: I/O 9 QID 0 timeout, completion polled
[  353.274913] nvme nvme0: I/O 14 QID 0 timeout, completion polled
[  353.276365] hwmon hwmon2: temp1_input not attached to any thermal zone
[  363.499229] INFO: task kworker/u8:4:167 blocked for more than 241 seconds.
[  363.499892]       Not tainted 6.1.0-12-arm64 #1 Debian 6.1.52-1
[  363.500427] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.501129] task:kworker/u8:4    state:D stack:0     pid:167   ppid:2      flags:0x00000008
[  363.501890] Workqueue: events_unbound async_run_entry_fn
[  363.502394] Call trace:
[  363.502623]  __switch_to+0xf0/0x170
[  363.502956]  __schedule+0x340/0x940
[  363.503329]  schedule+0x58/0xf0
[  363.503631]  schedule_timeout+0x14c/0x180
[  363.504004]  __wait_for_common+0xd4/0x254
[  363.504377]  wait_for_completion+0x28/0x3c
[  363.504756]  __flush_work.isra.0+0x180/0x2dc
[  363.505154]  flush_work+0x18/0x2c
[  363.505465]  nvme_async_probe+0x24/0x44 [nvme]
[  363.505910]  async_run_entry_fn+0x40/0x1e0
[  363.506294]  process_one_work+0x1f4/0x460
[  363.506669]  worker_thread+0x188/0x4e0
[  363.507059]  kthread+0xe0/0xe4
[  363.507353]  ret_from_fork+0x10/0x20
[  363.507742] INFO: task (udev-worker):706 blocked for more than 241 seconds.
[  363.508370]       Not tainted 6.1.0-12-arm64 #1 Debian 6.1.52-1
[  363.508903] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  363.509602] task:(udev-worker)   state:D stack:0     pid:706   ppid:277    flags:0x0000080d
[  363.510359] Call trace:
[  363.510588]  __switch_to+0xf0/0x170
[  363.510916]  __schedule+0x340/0x940
[  363.511289]  schedule+0x58/0xf0
[  363.511590]  async_synchronize_cookie_domain+0xe8/0x150
[  363.512073]  async_synchronize_full+0x20/0x30
[  363.512478]  do_init_module+0x160/0x1f4
[  363.512834]  load_module+0x1cb8/0x2220
[  363.513182]  __do_sys_finit_module+0xac/0x130
[  363.513584]  __arm64_sys_finit_module+0x28/0x34
[  363.514002]  invoke_syscall+0x78/0x100
[  363.514355]  el0_svc_common.constprop.0+0xd4/0xf4
[  363.514791]  do_el0_svc+0x34/0xd0
[  363.515144]  el0_svc+0x34/0xd4
[  363.515442]  el0t_64_sync_handler+0xf4/0x120
[  363.515841]  el0t_64_sync+0x18c/0x190
[  414.724449] nvme nvme0: I/O 10 QID 0 timeout, completion polled
[  476.157964] nvme nvme0: I/O 15 QID 0 timeout, completion polled
[  484.334241] INFO: task kworker/u8:4:167 blocked for more than 362 seconds.
[  484.334908]       Not tainted 6.1.0-12-arm64 #1 Debian 6.1.52-1
[  484.335443] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  484.336141] task:kworker/u8:4    state:D stack:0     pid:167   ppid:2      flags:0x00000008
[  484.336903] Workqueue: events_unbound async_run_entry_fn
[  484.337409] Call trace:
[  484.337638]  __switch_to+0xf0/0x170
[  484.337968]  __schedule+0x340/0x940
[  484.338345]  schedule+0x58/0xf0
[  484.338647]  schedule_timeout+0x14c/0x180
[  484.339020]  __wait_for_common+0xd4/0x254
[  484.339394]  wait_for_completion+0x28/0x3c
[  484.339773]  __flush_work.isra.0+0x180/0x2dc
[  484.340170]  flush_work+0x18/0x2c
[  484.340481]  nvme_async_probe+0x24/0x44 [nvme]
[  484.340926]  async_run_entry_fn+0x40/0x1e0
[  484.341311]  process_one_work+0x1f4/0x460
[  484.341684]  worker_thread+0x188/0x4e0
[  484.342078]  kthread+0xe0/0xe4
[  484.342373]  ret_from_fork+0x10/0x20
[  484.342766] INFO: task (udev-worker):706 blocked for more than 362 seconds.
[  484.343397]       Not tainted 6.1.0-12-arm64 #1 Debian 6.1.52-1
[  484.343930] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  484.344629] task:(udev-worker)   state:D stack:0     pid:706   ppid:277    flags:0x0000080d
[  484.345387] Call trace:
[  484.345616]  __switch_to+0xf0/0x170
[  484.345944]  __schedule+0x340/0x940
[  484.346321]  schedule+0x58/0xf0
[  484.346624]  async_synchronize_cookie_domain+0xe8/0x150
[  484.347109]  async_synchronize_full+0x20/0x30
[  484.347514]  do_init_module+0x160/0x1f4
[  484.347870]  load_module+0x1cb8/0x2220
[  484.348217]  __do_sys_finit_module+0xac/0x130
[  484.348616]  __arm64_sys_finit_module+0x28/0x34
[  484.349033]  invoke_syscall+0x78/0x100
[  484.349386]  el0_svc_common.constprop.0+0xd4/0xf4
[  484.349820]  do_el0_svc+0x34/0xd0
[  484.350175]  el0_svc+0x34/0xd4
[  484.350473]  el0t_64_sync_handler+0xf4/0x120
[  484.350874]  el0t_64_sync+0x18c/0x190
[  537.599512] nvme nvme0: I/O 11 QID 0 timeout, completion polled

NVME device is WD Red SN700

inindev commented 1 month ago

How are Ubuntu and the WRTs working? Different DTBs?

I have the r5s booting with both Samsung and Crucial NVMe using the upstream debian kernel, v 6.1.0-23 in this case:

stock debian:

$ uname -a
Linux deb-nanopi-r5s 6.1.0-23-arm64 #1 SMP Debian 6.1.99-1 (2024-07-15) aarch64 GNU/Linux

OpenWRT is not available in a release, just overnight builds which are based off of the 6.6.43 kernel:

nightly OpenWRT:

# uname -a
Linux owrt-nanopi-r5s 6.6.43 #0 SMP PREEMPT Sun Jul 28 17:51:47 2024 aarch64 GNU/Linux

My guess is that your Sandisk NVMe works better with a more recent kernel?

To test this theory, you could try a later debian kernel: https://packages.debian.org/sid/kernel/linux-image-arm64

wget http://ftp.us.debian.org/debian/pool/main/l/linux-signed-arm64/linux-image-6.9.12-arm64_6.9.12-1_arm64.deb
sudo dpkg -i linux-image-6.9.12-arm64_6.9.12-1_arm64.deb

welovewebs commented 1 month ago

So just to clarify, boot with your latest image from SD and wget that sid kernel? After dpkg installs it, will the SD boot that kernel automatically?

(I booted from the MMC to Ubuntu just to see, but there is nothing in /boot)

welovewebs commented 1 month ago

My guess is that your Sandisk NVMe works better with a more recent kernel?

Surely FriendlyWrt and Ubuntu Focal aren't using such a recent kernel though?

inindev commented 1 month ago

Yes, my assumption is that you are booting from MMC and trying to access your installed Sandisk NVMe.

wget https://github.com/inindev/debian-image/releases/download/v12.6/nanopi-r5s_bookworm-12.6.img.xz
sudo su
xzcat nanopi-r5s_bookworm-12.6.img.xz > /dev/sdX

Then install this MMC and boot from it.

ll /dev/nvme*
crw------- 1 root root 235, 0 Aug  3 03:53 /dev/nvme0
brw-rw---- 1 root disk 259, 0 Aug  3 03:53 /dev/nvme0n1
brw-rw---- 1 root disk 259, 1 Aug  3 03:53 /dev/nvme0n1p1

I am unsure why you are getting this:

pci 0002:01:00.0: calc_l12_pwron: Invalid T_PwrOn scale: 3
pci 0002:01:00.0: BAR0: error updating (0xf0200004 != 0xffffffff)
pci 0002:01:00.0: BAR0: error updating (high 0x00000000 != 0xffffffff)
pci 0002:01:00.0: BAR4: error updating (0xf0204004 != 0xffffffff)
pci 0002:01:00.0: BAR4: error updating (high 0x00000000 != 0xffffffff)

welovewebs commented 1 month ago

I am booting your Debian image from an SD card, but have Ubuntu (FriendlyCore) on the internal eMMC

I just installed that 6.9 kernel and again, the NVMe was initialised, but when I removed the pcie_aspm=off from the boot flags, it was no longer initialised, even with the 6.9 kernel.

Yeah, Google doesn't have much regarding that T_PwrOn error. Again strange how those other two OSes can initialise it OK

EDIT: just FYI, my NVMe is actually WD Blue SN570 500GB with firmware 234110WD, shown by cat /sys/class/nvme/nvme0/model and cat /sys/class/nvme/nvme0/firmware_rev, which I see is the same as the SSD you ordered in September :)

inindev commented 1 month ago

I see my post: inindev commented on Sep 24, 2023

I can test the WD Blue again but I wont get a chance to do so for a few days.

welovewebs commented 1 month ago

Fair enough. I'm curious whether you have the same dmesg output with your WD Blue. I ran a SMART check with nvme-cli which said the drive was OK. The T_PwrOn scale error makes sense given that disabling power management results in a working configuration, but the BAR errors I'm not sure about. Can they be related to power management?

In the mean time, it works; albeit without power management on the PCIe bus.

Thanks for this project too, by the way. Great to have vanilla Debian on this device!

inindev commented 1 month ago

I found my WD Blue NVMe and am now seeing the same thing you are with the stock debian 6.1.0-22-arm64 kernel.

[    6.192192] pci_bus 0002:00: root bus resource [bus 00-0f]
[    6.192700] pci_bus 0002:00: root bus resource [io  0x200000-0x2fffff] (bus address [0xf0100000-0xf01fffff])
[    6.193582] pci_bus 0002:00: root bus resource [mem 0xf0200000-0xf1ffffff]
[    6.194205] pci_bus 0002:00: root bus resource [mem 0x380000000-0x3bfffffff] (bus address [0x40000000-0x7fffffff])
[    6.195223] pci 0002:00:00.0: [1d87:3566] type 01 class 0x060400
[    6.195790] pci 0002:00:00.0: reg 0x10: [mem 0x00000000-0x3fffffff]
[    6.196366] pci 0002:00:00.0: reg 0x14: [mem 0x00000000-0x3fffffff]
[    6.196939] pci 0002:00:00.0: reg 0x38: [mem 0x00000000-0x0000ffff pref]
[    6.197620] pci 0002:00:00.0: supports D1 D2
[    6.198015] pci 0002:00:00.0: PME# supported from D0 D1 D3hot
[    6.203456] pci_bus 0002:01: busn_res: can not insert [bus 01-ff] under [bus 00-0f] (conflicts with (null) [bus 00-0f])
[    6.204719] pci 0002:00:00.0: BAR 0: assigned [mem 0x380000000-0x3bfffffff]
[    6.205379] pci 0002:00:00.0: BAR 1: no space for [mem size 0x40000000]
[    6.205982] pci 0002:00:00.0: BAR 1: failed to assign [mem size 0x40000000]
[    6.206666] pci 0002:00:00.0: BAR 6: assigned [mem 0xf0200000-0xf020ffff pref]
[    6.207334] pci 0002:00:00.0: PCI bridge to [bus 01-ff]
[    6.210255] pcieport 0002:00:00.0: PME: Signaling with IRQ 46
[    6.211405] pcieport 0002:00:00.0: AER: enabled with IRQ 46

I also see that u-boot is initializing this device just fine:

=> nvme info
Device 0: Vendor: 0x15b7 Rev: 234110WD Prod: 23185D807137
            Type: Hard Disk
            Capacity: 476940.0 MB = 465.7 GB (976773168 x 512)

It looks timing related because a rescan is able to detect it:

# echo 1 > /sys/bus/pci/rescan
[  561.031428] nvme nvme0: pci function 0002:01:00.0
[  561.031946] nvme 0002:01:00.0: enabling device (0000 -> 0002)

# ls -al /dev/nvme0
crw------- 1 root root 510, 0 Jun 16 10:18 /dev/nvme0

followed by a failure a short time later:

[  621.553113] nvme nvme0: I/O 8 QID 0 timeout, disable controller
[  621.569577] nvme nvme0: Identify Controller failed (-4)
[  621.570116] nvme nvme0: Removing after probe failure status: -5

inindev commented 1 month ago

I did find this procedure successfully initializes the WD Blue NVMe with the 6.1.0-22-arm64 kernel:

# echo 1 > /sys/bus/pci/devices/0002\:00\:00.0/remove
# echo 1 > /sys/bus/pci/rescan

Though it is not a solution, it does point to an initialization issue during boot.

inindev / nanopi-r5

NVME is not functional #23