linux-nvme / nvme-cli

NVMe management command line interface.
https://nvmexpress.org
GNU General Public License v2.0
1.47k stars 655 forks source link

Fixed the SR-IOV fault of PM1733/PM1735. #1126

Closed daiaji closed 2 years ago

daiaji commented 3 years ago
nvme list-ctrl /dev/nvme0 -n2
num of ctrls present: 1
[   0]:0x41

nvme virt-mgmt /dev/nvme0n2 -c 0x0001 -r0 -n2 -a8
success, Number of Controller Resources Modified (NRM):0x2

nvme virt-mgmt /dev/nvme0n2 -c 0x0001 -r1 -n2 -a8
success, Number of Controller Resources Modified (NRM):0x2

nvme virt-mgmt /dev/nvme0n2 -c 0x0001 -a9
NVMe status: INVALID_CTRL_ID: An invalid Controller Identifier was specified.(0x11f)

nvme list-secondary /dev/nvme0n2
Identify Secondary Controller List:
   NUMID       : Number of Identifiers           : 32
   SCEntry[0  ]:
................
     SCID      : Secondary Controller Identifier : 0x0001
     PCID      : Primary Controller Identifier   : 0x0041
     SCS       : Secondary Controller State      : 0x0000 (Offline)
     VFN       : Virtual Function Number         : 0x0001
     NVQ       : Num VQ Flex Resources Assigned  : 0x0002
     NVI       : Num VI Flex Resources Assigned  : 0x0002
   SCEntry[1  ]:
keithbusch commented 3 years ago

Does the primary controller require SR-IOV be enabled prior to onlining it's virtual controllers?

daiaji commented 3 years ago

@keithbusch You are right. After setting the number of VFs, the auxiliary controller can be online, but it seems that no matter which VF is connected to, the block device cannot be found in the VM.

I created and deleted the default namespace, and created two new namespaces and attached them to the 0x41 controller.๏ผˆThe Samsung pm1733 marks the 0x0041 controller as the active master controller.๏ผ‰

nvme create-ns /dev/nvme0 -s 5358197520 -c 5358197520 -f 0 -d 0 -m 0
nvme create-ns /dev/nvme0 -s 2143279008 -c 2143279008 -f 0 -d 0 -m 0
nvme attach-ns /dev/nvme0 -n 1 -c 0x41
nvme attach-ns /dev/nvme0 -n 2 -c 0x41
nvme reset /dev/nvme0

After setting up the VQ VI for the 0x0002 controller, I passed the 01:00.2 device PCI to the VM, but I did not find the block device in the list output by the lsblk command of the Guest OS. Then I tried to set the VQ VI for the 0x0001 controller and pass through the 01:00.2 device to the VM. The block device was not found in the list output by the lsblk command of the Guest OS.

echo 4 > /sys/class/nvme/nvme0/device/sriov_numvfs
nvme virt-mgmt /dev/nvme0n2 -c 0x0002 -r0 -n2 -a8
nvme virt-mgmt /dev/nvme0n2 -c 0x0002 -r1 -n2 -a8
nvme virt-mgmt /dev/nvme0n2 -c 0x0002 -a9

nvme list-secondary /dev/nvme0
      SCID      : Secondary Controller Identifier : 0x0002
      PCID      : Primary Controller Identifier   : 0x0041
      SCS       : Secondary Controller State      : 0x0001 (Online)
      VFN       : Virtual Function Number         : 0x0002
      NVQ       : Num VQ Flex Resources Assigned  : 0x0002
      NVI       : Num VI Flex Resources Assigned  : 0x0002
   SCEntry[2  ]:

lspci | grep PM173X
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
01:00.2 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
01:00.3 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
01:00.4 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
01:00.5 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X

Did I do something wrong? Each SSD that supports SR-IOV has a different method of enabling SR-IOV? Do I need to use proprietary software outside the nvme specification?

keithbusch commented 3 years ago

your platform probably did not provide enough PCI busses through the root port. Can you see the PCI functions in lspci? If not, you will need to ask the kernel to re-enumerate the PCI bus, but you can set those parameters at boot time. It should be kernel paramters "pci=realloc,assign-busses,nocrs", if i recall correctly. In some cases, though, the kernel may not be able to successfully renumber the bus, so even that might fail.

daiaji commented 3 years ago
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd Device a801
        Flags: bus master, fast devsel, latency 0, IRQ 43, NUMA node 0, IOMMU group 15
        Memory at fcd10000 (64-bit, non-prefetchable) [size=32K]
        Expansion ROM at fcc00000 [disabled] [size=64K]
        Capabilities: [40] Power Management version 3
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [b0] MSI-X: Enable+ Count=64 Masked-
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [148] Device Serial Number 93-08-50-11-94-38-25-00
        Capabilities: [168] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [178] Secondary PCI Express
        Capabilities: [198] Physical Layer 16.0 GT/s <?>
        Capabilities: [1c0] Lane Margining at the Receiver <?>
        Capabilities: [1e8] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [3a4] Data Link Feature <?>
        Kernel driver in use: nvme

01:00.2 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd Device a801
        Flags: fast devsel, NUMA node 0, IOMMU group 34
        Memory at fcc10000 (64-bit, non-prefetchable) [virtual] [size=32K]
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [b0] MSI-X: Enable- Count=580 Masked-
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [148] Alternative Routing-ID Interpretation (ARI)

01:00.3 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd Device a801
        Flags: fast devsel, NUMA node 0, IOMMU group 35
        Memory at fcc18000 (64-bit, non-prefetchable) [virtual] [size=32K]
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [b0] MSI-X: Enable- Count=580 Masked-
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [148] Alternative Routing-ID Interpretation (ARI)

It seems there is.

keithbusch commented 3 years ago

Oh, gotcha. I had a platform that enumerates the VFs as a different bus number from the PF, so I didn't put together that those BDf's were your secondary controllers.

In your commands, it says you are attaching two namespaces to CNTID x41, but you are assigning CNTID x1 to the guest. Is that correct? If so, it doesn't sound like the guest would be able to see those namespaces.

daiaji commented 3 years ago

Does it mean the 0x0001 auxiliary controller output by list-secondary? Then I think so. Sorry, I don't particularly know how to map namespaces to VF.

keithbusch commented 3 years ago

Yeah, I think you wanted to do something like nvme attach-ns /dev/nvme0 -n 1 -c 0x1 instead. If you can do multi-controller namespaces, you can attach to multiple controllers at the same time like nvme attach-ns /dev/nvme0 -n 1 -c 1,2,3,4

daiaji commented 3 years ago
nvme attach-ns /dev/nvme0 -n 2 -c 1,2,3,4
NVMe status: NS_IS_PRIVATE: The namespace is private and is already attached to one controller(0x2119

It does not seem possible to attach to multiple controllers.

keithbusch commented 3 years ago

If you want to do multiple controllers, and if the controller supports it (check primary's id-ctrl cmic value), then you should be able to enable that with the --nmic=1 option on the create-ns command.

daiaji commented 3 years ago

When I attach namespace 2 to main controller 1, the output of the id-ns command seems a little strange.

nvme id-ns /dev/nvme0 -n2
NVME Identify Namespace 2:
nsze    : 0
ncap    : 0
nuse    : 0
nsfeat  : 0
nlbaf   : 0
flbas   : 0
mc      : 0
dpc     : 0
dps     : 0
nmic    : 0
rescap  : 0
fpi     : 0
dlfeat  : 0
nawun   : 0
nawupf  : 0
nacwu   : 0
nabsn   : 0
nabo    : 0
nabspf  : 0
noiob   : 0
nvmcap  : 0
mssrl   : 0
mcl     : 0
msrc    : 0
anagrpid: 0
nsattr  : 0
nvmsetid: 0
endgid  : 0
nguid   : 00000000000000000000000000000000
eui64   : 0000000000000000
lbaf  0 : ms:0   lbads:0  rp:0 (in use)

This namespace doesn't seem to be working?

keithbusch commented 3 years ago

Try adding the -f parameter to id-ns.

daiaji commented 3 years ago

It seems that there are 66 main controllers. Do I have to attach the namespace to these controllers one by one and pass through to the VM for testing? Is there no command to report the mapping relationship between namespace and controller and VF?๐Ÿ˜ญ

keithbusch commented 3 years ago

I'm a little confused by your terminology. The spec uses "primary" and "secondary" controller terms. What is a "main" controller?

If you want to see which controller ID's of a subsystem are attached to a particular namespace ID, you can run, for example, nvme list-ctrl -n 1 for namespace ID 1.

daiaji commented 3 years ago

Is the mapping relationship between the controller and the VF only known to the equipment manufacturer?

keithbusch commented 3 years ago

Correct, there is no spec guidance on how controller ID's are assigned to any particular controller within a NVM subsystem, which includes VFs.

daiaji commented 3 years ago

I attached the namespace to all the controllers one by one and passed them through to the VM, but I didn't seem to find any block devices in the VM.

keithbusch commented 3 years ago

can you see the controllers and block device if you let the VF's bind to the host driver instead of a guest instance?

daiaji commented 3 years ago

It seems that only when I attach the namespace to the 0x41 controller, can I find the block device in the host, and if I attach the namespace to other controllers, I cannot find the block device in the host. This seems very strange. When I use NIC's SR-IOV, it seems that the host can also use VF devices.

keithbusch commented 3 years ago

While the VF is bound to the host driver, could you run nvme list -v?

daiaji commented 3 years ago
lspci
01:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
01:00.2 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
01:00.3 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
01:00.4 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
01:00.5 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X

nvme list -v
NVM Express Subsystems

Subsystem        Subsystem-NQN                                                                                    Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys0     nqn.1994-11.com.samsung:nvme:PM1733:2.5-inch:S4YPNG0R400619                                      nvme0
nvme-subsys1     nqn.2014.08.org.nvmexpress:80868086PHM274900219280AGN  INTEL SSDPE21D280GA                       nvme1

NVM Express Controllers

Device   SN                   MN                                       FR       TxPort Address        Subsystem    Namespaces      
-------- -------------------- ---------------------------------------- -------- ------ -------------- ------------ ----------------
nvme0    S4YPNG0R400619       SAMSUNG MZWLJ3T8HBLS-00007               EPK9AB5Q pcie   0000:01:00.0   nvme-subsys0 nvme0c0n1, nvme0c0n10, nvme0c0n11, nvme0c0n12, nvme0c0n13, nvme0c0n14, nvme0c0n15, nvme0c0n16, nvme0c0n17, nvme0c0n18, nvme0c0n19, nvme0c0n2, nvme0c0n20, nvme0c0n21, nvme0c0n22, nvme0c0n23, nvme0c0n24, nvme0c0n25, nvme0c0n26, nvme0c0n27, nvme0c0n28, nvme0c0n29, nvme0c0n3, nvme0c0n30, nvme0c0n31, nvme0c0n32, nvme0c0n4, nvme0c0n5, nvme0c0n6, nvme0c0n7, nvme0c0n8, nvme0c0n9
nvme1    PHM274900219280AGN   INTEL SSDPE21D280GA                      E2010325 pcie   0000:23:00.0   nvme-subsys1 nvme1n1

NVM Express Namespaces

Device       NSID     Usage                      Format           Controllers     
------------ -------- -------------------------- ---------------- ----------------
nvme0n1      1         54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n10     10        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n11     11        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n12     12        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n13     13        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n14     14        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n15     15        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n16     16        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n17     17        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n18     18        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n19     19        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n2      2         54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n20     20        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n21     21        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n22     22        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n23     23        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n24     24        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n25     25        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n26     26        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n27     27        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n28     28        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n29     29        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n3      3         54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n30     30        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n31     31        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n32     32        54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n4      4         54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n5      5         54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n6      6         54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n7      7         54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n8      8         54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme0n9      9         54.87  GB /  54.87  GB    512   B +  0 B   nvme0
nvme1n1      1        280.07  GB / 280.07  GB    512   B +  0 B   nvme1
keithbusch commented 3 years ago

It doesn't look like any VF controllers are bound to the host driver in this output. Are there any nvme errors indicated in the 'dmesg' for those functions?

daiaji commented 3 years ago
[   59.170930] pci 0000:01:00.2: [144d:a824] type 00 class 0x010802
[   59.171148] pci 0000:01:00.2: Adding to iommu group 34
[   59.171290] nvme nvme2: pci function 0000:01:00.2
[   59.171318] nvme 0000:01:00.2: enabling device (0000 -> 0002)
[   59.171324] pci 0000:01:00.3: [144d:a824] type 00 class 0x010802
[   59.171615] pci 0000:01:00.3: Adding to iommu group 35
[   59.171690] nvme nvme3: pci function 0000:01:00.3
[   59.171715] pci 0000:01:00.4: [144d:a824] type 00 class 0x010802
[   59.171740] nvme 0000:01:00.3: enabling device (0000 -> 0002)
[   59.171922] pci 0000:01:00.4: Adding to iommu group 36
[   59.171986] nvme nvme4: pci function 0000:01:00.4
[   59.172009] pci 0000:01:00.5: [144d:a824] type 00 class 0x010802
[   59.172034] nvme 0000:01:00.4: enabling device (0000 -> 0002)
[   59.172237] pci 0000:01:00.5: Adding to iommu group 37
[   59.172301] nvme nvme5: pci function 0000:01:00.5
[   59.172315] nvme 0000:01:00.5: enabling device (0000 -> 0002)
[   89.674264] nvme nvme5: Device not ready; aborting initialisation, CSTS=0x2
[   89.674270] nvme nvme5: Removing after probe failure status: -19
[   89.675266] nvme nvme2: Device not ready; aborting initialisation, CSTS=0x2
[   89.675272] nvme nvme2: Removing after probe failure status: -19
[   89.676219] nvme nvme4: Device not ready; aborting initialisation, CSTS=0x2
[   89.676219] nvme nvme3: Device not ready; aborting initialisation, CSTS=0x2
[   89.676223] nvme nvme3: Removing after probe failure status: -19
[   89.676223] nvme nvme4: Removing after probe failure status: -19
keithbusch commented 3 years ago

Okay, looks broken. I think you have to take this to the vendor at this point.

daiaji commented 3 years ago

https://stackoverflow.com/questions/65350988/how-to-setup-sr-iov-with-samsung-pm1733-1735-nvme-ssd It doesn't seem to be an isolated case, it may be the wrong firmware.

igaw commented 2 years ago

I understand this has been 'resolved'. Closing the issue.

daiaji commented 2 years ago

It's actually really painful. Even now, it's unknown whether the SSD's firmware or driver is faulty. Since this is a second-hand SSD I bought, I can't find the corresponding customer support. But thank you for your answer, without your answer I might waste more time.๐Ÿ˜€

0xabu commented 2 years ago

@daiaji FWIW I did get a PM1735 to work with SR-IOV. I found that you need to enable all 32 VFs (basically, cat sriov_totalvfs > sriov_numvfs). If you enable fewer, then the nvme virt-mgmt ... -a 9 command always fails to bring the secondary controller online. In the process of futzing around, I also got the controller into an unhappy state that was only resolved after a whole-system reboot, so maybe try that too if you haven't already.

daiaji commented 2 years ago

Thanks for your reply, I will try it.๐Ÿ˜˜

daiaji commented 2 years ago

@0xabu

nvme create-ns /dev/nvme0 -s 5358197520 -c 5358197520 -f 0 -d 0 -m 0
nvme create-ns /dev/nvme0 -s 2143279008 -c 2143279008 -f 0 -d 0 -m 0
nvme attach-ns /dev/nvme0 -n 1 -c 0x41
nvme attach-ns /dev/nvme0 -n 2 -c 0x41
nvme reset /dev/nvme0
cat /sys/class/nvme/nvme0/device/sriov_totalvfs > /sys/class/nvme/nvme0/device/sriov_numvfs
nvme virt-mgmt /dev/nvme0n2 -c 0x0001 -r0 -n2 -a8
nvme virt-mgmt /dev/nvme0n2 -c 0x0001 -r1 -n2 -a8
nvme virt-mgmt /dev/nvme0n2 -c 0x0001 -a9

Then I directly connected all PCI devices from 01:00.2 to 01:00.7 in QEMU, but I didn't find a block device in guest.

piotrekz79 commented 2 years ago

All, thanks for the updates, I am fighting with a similar problem.

My nvme is SAMSUNG MZWLJ7T6HALA-00007, host Ubuntu 18.04 (5.4.0) and nvme version 2.0 (compiled from source), guest Ubuntu 20.04

I tried to follow [1]

https://lore.kernel.org/all/20211027164930.GC3331@lmaniak-dev.igk.intel.com/

so had some extra steps for a primary controller x041

nvme virt-mgmt /dev/nvme0 -c 0x41 -r 1 -a 1 -n 0
nvme virt-mgmt /dev/nvme0 -c 0x41 -r 0 -a 1 -n 0
nvme reset /dev/nvme0
echo 1 > /sys/bus/pci/rescan

then following @0xabu suggestion (previously I had 4)

echo 32 > /sys/class/nvme/nvme0/device/sriov_numvfs

which results in lspci reporting new devices

1a:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824
1a:00.1 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824
...
1a:04.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a824

note, following [1], I am not trying to assign VQ/VI to namespace but to the nvmeX device and this step works

nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1
nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2
nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0

nvme list-secondary /dev/nvme0 | head

Identify Secondary Controller List:
   NUMID       : Number of Identifiers           : 32
   SCEntry[0  ]:
................
     SCID      : Secondary Controller Identifier : 0x0001
     PCID      : Primary Controller Identifier   : 0x0041
     SCS       : Secondary Controller State      : 0x0001 (Online)
     VFN       : Virtual Function Number         : 0x0001
     NVQ       : Num VQ Flex Resources Assigned  : 0x0002
     NVI       : Num VI Flex Resources Assigned  : 0x0001

so far so good, then I add PCI device (1a:00.1) to VM and on the guest I see

lspci
00:08.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X

but no /dev/nvmeX in lsblk on a guest and

[  687.514627] pci 0000:00:08.0: [144d:a824] type 00 class 0x010802
[  687.515493] pci 0000:00:08.0: reg 0x10: [mem 0x00000000-0x00007fff 64bit]
[  687.516197] pci 0000:00:08.0: enabling Extended Tags
[  687.517180] pci 0000:00:08.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown speed x0 link at 0000:00:08.0 (capable of 63.012 Gb/s with 16 GT/s x4 link)
[  687.519176] pci 0000:00:08.0: BAR 0: assigned [mem 0x440000000-0x440007fff 64bit]
[  687.716227] nvme nvme0: pci function 0000:00:08.0
[  687.716340] nvme 0000:00:08.0: enabling device (0000 -> 0002)
[  718.290794] nvme nvme0: Device not ready; aborting initialisation
[  718.294800] nvme nvme0: Removing after probe failure status: -19

in a meantime on the host

May 13 16:47:48 pc-comp07 kernel: [ 5262.894466] vfio-pci 0000:1a:00.1: enabling device (0000 -> 0002)
May 13 16:48:13 pc-comp07 kernel: [ 5287.751826] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10
May 13 16:48:13 pc-comp07 kernel: [ 5288.036439] nvme nvme0: Shutdown timeout set to 10 seconds
May 13 16:48:13 pc-comp07 kernel: [ 5288.048857] nvme nvme0: 63/0/0 default/read/poll queues

which is confirmed by secondary controller going offline

Identify Secondary Controller List:
   NUMID       : Number of Identifiers           : 32
   SCEntry[0  ]:
................
     SCID      : Secondary Controller Identifier : 0x0001
     PCID      : Primary Controller Identifier   : 0x0041
     SCS       : Secondary Controller State      : 0x0000 (Offline)
     VFN       : Virtual Function Number         : 0x0001
     NVQ       : Num VQ Flex Resources Assigned  : 0x0000
     NVI       : Num VI Flex Resources Assigned  : 0x0000

For the record, on the host

1a:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd Device [144d:a824] (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd Device [144d:a801]
        Physical Slot: 0-6
        Flags: bus master, fast devsel, latency 0, IRQ 41, NUMA node 0
        Memory at aae10000 (64-bit, non-prefetchable) [size=32K]
        Expansion ROM at aad00000 [disabled] [size=64K]
        Capabilities: [40] Power Management version 3
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [b0] MSI-X: Enable+ Count=64 Masked-
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [148] Device Serial Number 58-72-01-01-96-38-25-00
        Capabilities: [168] Alternative Routing-ID Interpretation (ARI)
        Capabilities: [178] #19
        Capabilities: [198] #26
        Capabilities: [1c0] #27
        Capabilities: [1e8] Single Root I/O Virtualization (SR-IOV)
        Capabilities: [3a4] #25
        Kernel driver in use: nvme
        Kernel modules: nvme

1a:00.1 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd Device [144d:a824] (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd Device [144d:a801]
        Physical Slot: 0-6
        Flags: fast devsel, NUMA node 0
        [virtual] Memory at aad10000 (64-bit, non-prefetchable) [size=32K]
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [b0] MSI-X: Enable- Count=580 Masked-
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [148] Alternative Routing-ID Interpretation (ARI)
        Kernel driver in use: vfio-pci
        Kernel modules: nvme

I have a different version of OS/kernel/qemu than in [1], this week I hope to move NVMe to a system where I can install Ubuntu 20.04/22.04

Regards

0xabu commented 2 years ago
nvme create-ns /dev/nvme0 -s 5358197520 -c 5358197520 -f 0 -d 0 -m 0
nvme create-ns /dev/nvme0 -s 2143279008 -c 2143279008 -f 0 -d 0 -m 0
nvme attach-ns /dev/nvme0 -n 1 -c 0x41
nvme attach-ns /dev/nvme0 -n 2 -c 0x41

@daiaji here you attached the new namespaces to the primary controller (0x41)

nvme virt-mgmt /dev/nvme0n2 -c 0x0001 -r0 -n2 -a8
nvme virt-mgmt /dev/nvme0n2 -c 0x0001 -r1 -n2 -a8
nvme virt-mgmt /dev/nvme0n2 -c 0x0001 -a9

... and here you enabled secondary controller (virtual function) #1. You need to detach the namespaces from controller 0x41 and attach them to controller 1.

keithbusch commented 2 years ago
lspci
00:08.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X

but no /dev/nvmeX in lsblk on a guest and

[  687.514627] pci 0000:00:08.0: [144d:a824] type 00 class 0x010802
[  687.515493] pci 0000:00:08.0: reg 0x10: [mem 0x00000000-0x00007fff 64bit]
[  687.516197] pci 0000:00:08.0: enabling Extended Tags
[  687.517180] pci 0000:00:08.0: 0.000 Gb/s available PCIe bandwidth, limited by Unknown speed x0 link at 0000:00:08.0 (capable of 63.012 Gb/s with 16 GT/s x4 link)
[  687.519176] pci 0000:00:08.0: BAR 0: assigned [mem 0x440000000-0x440007fff 64bit]
[  687.716227] nvme nvme0: pci function 0000:00:08.0
[  687.716340] nvme 0000:00:08.0: enabling device (0000 -> 0002)
[  718.290794] nvme nvme0: Device not ready; aborting initialisation
[  718.294800] nvme nvme0: Removing after probe failure status: -19

My first thought was the primary controller must be in some bad state.

in a meantime on the host

May 13 16:47:48 pc-comp07 kernel: [ 5262.894466] vfio-pci 0000:1a:00.1: enabling device (0000 -> 0002)
May 13 16:48:13 pc-comp07 kernel: [ 5287.751826] nvme nvme0: controller is down; will reset: CSTS=0x3, PCI_STATUS=0x10
May 13 16:48:13 pc-comp07 kernel: [ 5288.036439] nvme nvme0: Shutdown timeout set to 10 seconds
May 13 16:48:13 pc-comp07 kernel: [ 5288.048857] nvme nvme0: 63/0/0 default/read/poll queues

And that appears to confirm it! If your controller reports CSTS.CFS as 1 (0x3 in your above output), a reset is the required operation to proceed. Spec says "If the primary controller associated with a secondary controller is disabled or undergoes a Controller Level Reset, then the secondary controller shall implicitly transition to the Offline state."

So, it looks like you'd need to manually re-online each secondary controller. That seems a bit fragile if the guest requires host assistance when it hasn't done anything wrong...

daiaji commented 2 years ago

@0xabu

nvme virt-mgmt /dev/nvme0 -c 0x41 -r 1 -a 1 -n 0
nvme virt-mgmt /dev/nvme0 -c 0x41 -r 0 -a 1 -n 0

nvme reset /dev/nvme0

echo 1 > /sys/bus/pci/rescan

echo 32 > /sys/class/nvme/nvme0/device/sriov_numvfs

nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1
nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2
nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0

nvme list-secondary /dev/nvme0 | head        
Identify Secondary Controller List:
   NUMID       : Number of Identifiers           : 32
   SCEntry[0  ]:
................
     SCID      : Secondary Controller Identifier : 0x0001
     PCID      : Primary Controller Identifier   : 0x0041
     SCS       : Secondary Controller State      : 0x0001 (Online)
     VFN       : Virtual Function Number         : 0x0001
     NVQ       : Num VQ Flex Resources Assigned  : 0x0002
     NVI       : Num VI Flex Resources Assigned  : 0x0001
   SCEntry[1  ]:

nvme attach-ns /dev/nvme0 -n 2 -c 0x0001

nvme list -v 
Subsystem        Subsystem-NQN                                                                                    Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys2     nqn.2014.08.org.nvmexpress:144d144dS4GCNE0R404170      SAMSUNG MZVLB2T0HALB-000L7                nvme2
nvme-subsys1     nqn.2014.08.org.nvmexpress:1e491cc1ZTA22T0KA220440DW3  ZHITAI TiPlus5000 2TB                     nvme1
nvme-subsys0     nqn.1994-11.com.samsung:nvme:PM1733:2.5-inch:S4YPNG0R400619                                      nvme0

Device   SN                   MN                                       FR       TxPort Address        Subsystem    Namespaces      
-------- -------------------- ---------------------------------------- -------- ------ -------------- ------------ ----------------
nvme2    S4GCNE0R404170       SAMSUNG MZVLB2T0HALB-000L7               4M2QEXG7 pcie   0000:23:00.0   nvme-subsys2 nvme2n1
nvme1    ZTA22T0KA220440DW3   ZHITAI TiPlus5000 2TB                    ZTA08322 pcie   0000:22:00.0   nvme-subsys1 nvme1n1
nvme0    S4YPNG0R400619       SAMSUNG MZWLJ3T8HBLS-00007               EPK9AB5Q pcie   0000:01:00.0   nvme-subsys0 nvme0n1

Device       Generic      NSID     Usage                      Format           Controllers     
------------ ------------ -------- -------------------------- ---------------- ----------------
/dev/nvme2n1 /dev/ng2n1   1        736.85  GB /   2.05  TB    512   B +  0 B   nvme2
/dev/nvme1n1 /dev/ng1n1   1          2.05  TB /   2.05  TB    512   B +  0 B   nvme1
/dev/nvme0n1 /dev/ng0n1   1          2.74  TB /   2.74  TB    512   B +  0 B   nvme0

lspci.log https://gist.github.com/daiaji/3cb114264a536b9aeb8ccf91c4ada887

After attaching namespace 2 to controller 1, I don't see /dev/nvme0n2 in the host. ๐Ÿ˜ญ

0xabu commented 2 years ago

After attaching namespace 2 to controller 1, I don't see /dev/nvme0n2 in the host.

I think that's expected. You can't attach the namespace to both host and guest controllers at the same time. I also noticed that 'nvme id-ns' shows all zeros unless the namespace is attached to the primary, but I can access it just fine in the guest.

daiaji commented 2 years ago

@0xabu @piotrekz79 @keithbusch I also noticed that my lspci output doesn't seem to have 10:00.1. It seems that 10:00.1 is skipped. I actually passed the rest of the VF through to the guest, but it doesn't seem to work.

sudo nvme id-ctrl /dev/nvme0 | grep fr
fr        : EPK9AB5Q
frmw      : 0x17

Is my device firmware out of date? dmesg.log

0xabu commented 2 years ago

@daiaji I have fr EPK9CB5Q, but that's just what came on the card. I don't know of a public source for firmware updates.

daiaji commented 2 years ago

@daiaji I have fr EPK9CB5Q, but that's just what came on the card. I don't know of a public source for firmware updates.

It looks like it's just because the PM1733 and PM1735 use different firmware, I checked on some web pages and it seems that I'm already using the new firmware. So, what motherboard and CPU do you use? I guess it shouldn't have much to do with the Linux kernel version.

piotrekz79 commented 2 years ago

All, an update, sadly without any real success

firmware is the same as @daiaji report

root@power01:/home/ubuntu# sudo nvme id-ctrl /dev/$NVME | grep fr
fr        : EPK98B5Q
frmw      : 0x17

I installed the drive to supermicro AS-1114CS-TNR motherboard H12SSW-AN6 because I was able to install Ubuntu 22.04 there. That - comapred to intel system I tested previously - led to problems with IOMMU. Namely, after creating VFs, all of them landed in IOMMU group 0 (despite other devices having proper separation). I ended up with installing a kernel which has acs patch installed

https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF#Bypassing_the_IOMMU_groups\_(ACS_override_patch)

https://liquorix.net/

uname -a
Linux power01 5.17.0-9.1-liquorix-amd64 #1 ZEN SMP PREEMPT liquorix 5.17-13ubuntu1~jammy (2022-05-18) x86_64 x86_64 x86_64 GNU/Linux

root@power01:/home/ubuntu# cat /proc/cmdline
audit=0 intel_pstate=disable hpet=disable  BOOT_IMAGE=/boot/vmlinuz-5.17.0-9.1-liquorix-amd64 root=UUID=72aa7786-199c-476e-a5fd-3cf6149cac62 ro amd_iommu=on iommu=pt pcie_acs_override=downstream,multifunction,id:144d:a824

As a result, I got proper separation and possibility to pass a device to a VM (see below for IOMMU group)

I repeated previous steps, running VM guest on Ubuntu 22.04 also, trying both ( as @keithbusch suggested )

  1. having first secondary controller on host on-line and then add PCI device c4:00.1 to guest
  2. having first PCI device to added to guest and then secondary controller on host on-line
NVME=nvme3
nvme virt-mgmt /dev/$NVME -c 1 -r 1 -a 8 -n 1
nvme list-secondary /dev/$NVME | head
nvme virt-mgmt /dev/$NVME -c 1 -r 0 -a 8 -n 2
nvme list-secondary /dev/$NVME | head
nvme virt-mgmt /dev/$NVME -c 1 -r 0 -a 9 -n 0
nvme list-secondary /dev/$NVME | head

In both cases the result was the same: I can a device on a guest in lspci but not in lsblk and nvme nvme0: Device not ready; aborting initialisation, CSTS=0x2 error on guest

host

May 20 11:12:12 power01 kernel: vfio-pci 0000:c4:00.1: enabling device (0000 -> 0002)
c4:00.1 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd NVMe SSD Controller PM173X [144d:a824] (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd NVMe SSD Controller PM173X [144d:a801]
        Physical Slot: 7
        Flags: fast devsel, NUMA node 0, IOMMU group 97
        Memory at b7210000 (64-bit, non-prefetchable) [virtual] [size=32K]
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [b0] MSI-X: Enable- Count=580 Masked-
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [148] Alternative Routing-ID Interpretation (ARI)
        Kernel driver in use: vfio-pci

guest

[Fri May 20 11:12:12 2022] pci 0000:07:00.0: [144d:a824] type 00 class 0x010802
[Fri May 20 11:12:12 2022] pci 0000:07:00.0: reg 0x10: [mem 0x00000000-0x00007fff 64bit]
[Fri May 20 11:12:12 2022] pci 0000:07:00.0: Max Payload Size set to 128 (was 512, max 512)
[Fri May 20 11:12:12 2022] pci 0000:07:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:02.6 (capable of 63.012 Gb/s with 16.0 GT/s PCIe x4 link)
dc07fff 64bit]
[Fri May 20 11:12:12 2022] nvme nvme0: pci function 0000:07:00.0
[Fri May 20 11:12:12 2022] nvme 0000:07:00.0: enabling device (0000 -> 0002)
[Fri May 20 11:12:42 2022] nvme nvme0: Device not ready; aborting initialisation, CSTS=0x2
[Fri May 20 11:12:42 2022] nvme nvme0: Removing after probe failure status: -19

As a quick test, I also removed all VFs and added a whole 7.2TB drive to a guest - it shown immediately , without rebooting guest etc.

ubuntu@test01:~$ lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
loop0     7:0    0 61.9M  1 loop /snap/core20/1434
loop1     7:1    0 44.7M  1 loop /snap/snapd/15534
loop2     7:2    0 79.9M  1 loop /snap/lxd/22923
sr0      11:0    1  368K  0 rom
vda     252:0    0   10G  0 disk
โ”œโ”€vda1  252:1    0  9.9G  0 part /
โ”œโ”€vda14 252:14   0    4M  0 part
โ””โ”€vda15 252:15   0  106M  0 part /boot/efi
nvme0n1 259:1    0    7T  0 disk

I am running out of ideas - I can try to ask my supplier to contact samsung

piotrekz79 commented 2 years ago

@0xabu I also tried attaching namespace to secondary controller only (which I first brought on-line)

nvme delete-ns /dev/nvme3n1
nvme create-ns /dev/nvme3 -b 4096 -s 1073741824 -c 1073741824

then - as you mentioned - we do not see it on the host

root@power01:/home/ubuntu# nvme attach-ns -n1 -c0x0001 /dev/nvme3
attach-ns: Success, nsid:1

root@power01:/home/ubuntu# nvme list -v
Subsystem        Subsystem-NQN                                                                                    Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys3     nqn.1994-11.com.samsung:nvme:PM1733:2.5-inch:S546NE0N602916                                      nvme3
nvme-subsys2     nqn.2016-08.com.micron:nvme:nvm-subsystem-sn-21162E8B4C8B                                        nvme2
nvme-subsys1     nqn.2016-08.com.micron:nvme:nvm-subsystem-sn-2135312AD01B                                        nvme1
nvme-subsys0     nqn.2016-08.com.micron:nvme:nvm-subsystem-sn-2135312ACF69                                        nvme0

Device   SN                   MN                                       FR       TxPort Address        Subsystem    Namespaces
-------- -------------------- ---------------------------------------- -------- ------ -------------- ------------ ----------------
nvme3    S546NE0N602916       SAMSUNG MZWLJ7T6HALA-00007               EPK98B5Q pcie   0000:c4:00.0   nvme-subsys3
nvme2    21162E8B4C8B         Micron_9300_MTFDHAL3T2TDR                11300DU0 pcie   0000:c3:00.0   nvme-subsys2 nvme2n1
nvme1    2135312AD01B         Micron_9300_MTFDHAL3T2TDR                11300DU0 pcie   0000:c2:00.0   nvme-subsys1 nvme1n1
nvme0    2135312ACF69         Micron_9300_MTFDHAL3T2TDR                11300DU0 pcie   0000:c1:00.0   nvme-subsys0 nvme0n1

but when adding PCI device to guest I have the same device not ready error

May 20 13:29:43 test01 kernel: [ 7824.098540] pci 0000:07:00.0: [144d:a824] type 00 class 0x010802
May 20 13:29:43 test01 kernel: [ 7824.098650] pci 0000:07:00.0: reg 0x10: [mem 0x00000000-0x00007fff 64bit]
May 20 13:29:43 test01 kernel: [ 7824.098905] pci 0000:07:00.0: Max Payload Size set to 128 (was 512, max 512)
May 20 13:29:43 test01 kernel: [ 7824.100214] pci 0000:07:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:02.6 (capable of 63.012 Gb/s with 16.0 GT/s PCIe x4 link)
May 20 13:29:43 test01 kernel: [ 7824.104907] pci 0000:07:00.0: BAR 0: assigned [mem 0xfdc00000-0xfdc07fff 64bit]
May 20 13:29:43 test01 kernel: [ 7824.106971] nvme nvme0: pci function 0000:07:00.0
May 20 13:29:43 test01 kernel: [ 7824.107000] nvme 0000:07:00.0: enabling device (0000 -> 0002)
[ 7854.744864] nvme nvme0: Device not ready; aborting initialisation, CSTS=0x2
May 20 13:30:14 test01 kernel: [ 7854.744864] nvme nvme0: Device not ready; aborting initialisation, CSTS=0x2
May 20 13:30:14 test01 kernel: [ 7854.746425] nvme nvme0: Removing after probe failure status: -19

regards

daiaji commented 2 years ago

@piotrekz79 It seems that the server motherboard has the same fault, and I can only hope that Samsung will reply. ๐Ÿ˜ญ

iaGuoZhi commented 1 year ago

Hi, I also come across the same fault, Has Samsung replied?

@piotrekz79 It seems that the server motherboard has the same fault, and I can only hope that Samsung will reply. ๐Ÿ˜ญ

daiaji commented 1 year ago

@iaGuoZhi No ๐Ÿ˜ญ

Yiyuan-Dong commented 1 year ago

@0xabu Hi, recently I got a PM1735 to play with. It seems that I have brought online a secondary controller, but I failed to expose it to VM. Could you please share something about how you pass the VF to the VM?

First, I try to enable a secondary controller (controller 0x1), and the command seemed worked fine.

cd /sys/class/nvme/nvme0/device
sudo bash -c "sudo echo 32 > sriov_numvfs" 
sudo nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -n 2 -a 8
sudo nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -n 2 -a 8
sudo nvme virt-mgmt /dev/nvme0 -c 1 -a 9
lspci | grep Non

5e:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
5e:00.1 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
...
5e:04.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173X
60:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [Optane]
sudo nvme list-secondary /dev/nvme0 | head
Identify Secondary Controller List:
   NUMID       : Number of Identifiers           : 32
   SCEntry[0  ]:
................
     SCID      : Secondary Controller Identifier : 0x0001
     PCID      : Primary Controller Identifier   : 0x0041
     SCS       : Secondary Controller State      : 0x0001 (Online)
     VFN       : Virtual Function Number         : 0x0001
     NVQ       : Num VQ Flex Resources Assigned  : 0x0002
     NVI       : Num VI Flex Resources Assigned  : 0x0002

Then I tried to use use VFIO to pass the VF to VM

sudo modprobe vfio-pci

sudo bash -c 'echo 144d a824 > /sys/bus/pci/drivers/vfio-pci/new_id'

And I did get some vfio-pci device

dyy@r742:/sys/bus/pci/drivers/vfio-pci$ ls
0000:5e:00.1  0000:5e:00.6  0000:5e:01.3  0000:5e:02.0  0000:5e:02.5  0000:5e:03.2  0000:5e:03.7  new_id
0000:5e:00.2  0000:5e:00.7  0000:5e:01.4  0000:5e:02.1  0000:5e:02.6  0000:5e:03.3  0000:5e:04.0  remove_id
0000:5e:00.3  0000:5e:01.0  0000:5e:01.5  0000:5e:02.2  0000:5e:02.7  0000:5e:03.4  0000:60:00.0  uevent
0000:5e:00.4  0000:5e:01.1  0000:5e:01.6  0000:5e:02.3  0000:5e:03.0  0000:5e:03.5  bind          unbind
0000:5e:00.5  0000:5e:01.2  0000:5e:01.7  0000:5e:02.4  0000:5e:03.1  0000:5e:03.6  module

Then I passed the VFIO device to qemu, using command like:

qemu/build/qemu-system-x86_64 \
        -kernel ../guest/linux-5.15/arch/x86_64/boot/bzImage \
        -cpu qemu64 -smp 2 \
        -m 9G \
        -initrd ../files/initramfs.cpio.gz \
        -nographic \
        -append "console=ttyS0 root=/dev/vda, nokaslr" \
        -enable-kvm \
        -netdev user,id=net0 -device virtio-net-pci,netdev=net0 \
        -device vfio-pci,host=0000:5e:00.1

But after I excute the command above, the host would crash during VM boot and I have to cold reboot the server to make PX1735 available again. I think maybe secondary controller is different from the primary controller and shoud use different method to expose it. but I have no idea what I should do.

I have also tried to expose the VF to the host, I mean, bind the secondary controller as a normal PCI NVMe device.

sudo bash -c "echo -n 0000:5e:00.1 > /sys/bus/pci/drivers/nvme/bind"

But again the host server would crash and I have to recycle the power. Did I miss anything in the way of using SR-IOV?

0xabu commented 1 year ago

@Yiyuan-Dong I no longer have access to the hardware, but note that I had a PM1735 (not 1733). If it helps, here are some excerpts from scripts I had written:

To create/populate a namespace:

NVME_DEV=/dev/nvme0
SIZE_GB=512
BLOCK_SIZE=4096
VFNID=0

# Get the ID of the primary (host) controller
HOST_CNTLID=$(nvme id-ctrl $NVME_DEV -o json | jq .cntlid)

# Get the ID of the secondary (virtual function) controller
VIRT_CNTLID=$(nvme list-secondary $NVME_DEV -o json | jq '."secondary-controllers"[]|select(."virtual-function-number"=='$((VFNID + 1))')."secondary-controller-identifier"')

SIZE_BLOCKS=$((SIZE_GB * 1000000000 / BLOCK_SIZE))

echo "Primary controller ID:   $HOST_CNTLID"
echo "Secondary controller ID: $VIRT_CNTLID"
echo "Creating namespace with $SIZE_BLOCKS $BLOCK_SIZE-byte blocks..."

# Create the new namespace, and capture the output, which should be "create-ns: Success, created nsid:32"
out=$(nvme create-ns $NVME_DEV --nsze=$SIZE_BLOCKS --ncap=$SIZE_BLOCKS --block-size $BLOCK_SIZE)

NSID=${out#*created nsid:}
echo "Created namespace $NSID"

# Attach the namespace to the host
nvme attach-ns $NVME_DEV -n $NSID -c $HOST_CNTLID

# Wait for the namespace to populate
while true; do
    NSDEV=$(nvme list -o json | jq -r ".Devices[]|select(.NameSpace==$NSID).DevicePath")
    if [ -n "$NSDEV" ]; then
        break
    fi

    sleep 1
    nvme ns-rescan $NVME_DEV
done

# Partition/format/image the new namespace
(...)

# Detach the new namespace from the primary controller
nvme detach-ns $NVME_DEV -c $HOST_CNTLID -n $NSID

# Attach it to the secondary controller
nvme attach-ns $NVME_DEV -c $VIRT_CNTLID -n $NSID

Booting a guest VM:

local sysfsdir=/sys/bus/pci/devices/$NVME_PF
local numvfs
read numvfs < $sysfsdir/sriov_numvfs
if [ $numvfs -eq 0 ]; then
  # XXX: assign VQ & VI resources for all the controllers we might need, before enabling any
  # (if we assign these resources later, then the command to online the secondary fails)
  local vq_max=$(nvme primary-ctrl-caps $nvme_dev -o json | jq .vqfrsm)
  local vi_max=$(nvme primary-ctrl-caps $nvme_dev -o json | jq .vifrsm)
  local cid
  for cid in $(nvme list-secondary $nvme_dev -o json | jq '."secondary-controllers"[]|select(."virtual-function-number" <= 4)."secondary-controller-identifier"'); do
    nvme virt-mgmt $nvme_dev -c $cid -r 0 -n $vq_max -a 8 > /dev/null
    nvme virt-mgmt $nvme_dev -c $cid -r 1 -n $vi_max -a 8 > /dev/null
  done

  # prevent probing of virtual function drivers, then create all the VFs
  echo -n 0 > $sysfsdir/sriov_drivers_autoprobe
  cat $sysfsdir/sriov_totalvfs > $sysfsdir/sriov_numvfs
fi

# Bring the secondary controller online
local cid=$(nvme list-secondary $nvme_dev -o json | jq '."secondary-controllers"[]|select(."virtual-function-number"=='$(($2 + 1))')."secondary-controller-identifier"')
nvme virt-mgmt $nvme_dev -c $cid -a 9 > /dev/null

# find the PCI ID of the VF
local vfnid=$(basename $(readlink $sysfsdir/virtfn$VFNID))

# bind to vfio-pci
echo vfio-pci > /sys/bus/pci/devices/$vfnid/driver_override
echo $vfnid >/sys/bus/pci/drivers_probe

# Now invoke QEMU, passing "-device vfio-pci,host=$vfnid"
igaw commented 1 year ago

These instructions look pretty awesome! I wonder if it would be a good idea to get these (maybe partially) added to blktests? Would this work against the soft target implementation of Linux? We certainly lack such complex tests...

Yiyuan-Dong commented 1 year ago

@0xabu Thanks a lot for your help! Your scripts looks awesome, though I still met the same problem after using your scripts...

I'm wondering if it's a matter of kernel configuration or BIOS configuration. I have enabled SR-IOV and IOMMU in both BIOS and kernel configuration. Is there any other settings that must be take care of? Would you please share the kernel version that succeeded to bring up the drive?

0xabu commented 1 year ago

@Yiyuan-Dong the guest kernel was Ubuntu 22.04 5.17.0-8. I believe the host was the same or similar, but don't recall for sure.

Yiyuan-Dong commented 1 year ago

@0xabu Thank you so much for the speedy reply.

Yiyuan-Dong commented 1 year ago

Update: After I moved the drvie from a rackmount server to my PC, I found my PC would not crash after it try to bind the drive, and the kernel log showed that the nvme driver had some strange behavior. I'd like to put some related kernel log here, begging someone to know what happened, or to let somebody know that I'm having the same problem.

Now I use the following script to try to bind the first secondary controller to the host, since the VF should be treated as hot-plugged PCI devices in the kernel.

#!/bin/bash

nvme virt-mgmt /dev/nvme2 -c 65 -r 1 -a 1 -n 0
nvme virt-mgmt /dev/nvme2 -c 65 -r 0 -a 1 -n 0
sudo nvme reset /dev/nvme2

sudo nvme virt-mgmt /dev/nvme2 -c 1 -r 0 -n 9 -a 8
sudo nvme virt-mgmt /dev/nvme2 -c 1 -r 1 -n 9 -a 8

sudo bash -c "sudo echo 0 > /sys/bus/pci/devices/0000:07:00.0/sriov_drivers_autoprobe" # no autoprobe
sudo bash -c "sudo echo 32 > /sys/class/nvme/nvme2/device/sriov_numvfs" # enable VF
sudo nvme virt-mgmt /dev/nvme2 -c 1 -a 9
sudo nvme list-secondary /dev/nvme2 | head

vfnid=0000:07:00.1
echo nvme > /sys/bus/pci/devices/$vfnid/driver_override
echo $vfnid > /sys/bus/pci/drivers_probe

After I execute the scipt, the log shows that

[  188.713739] nvme nvme4: pci function 0000:07:00.1
[  188.714064] nvme 0000:07:00.1: enabling device (0000 -> 0002)
[  216.443974] watchdog: BUG: soft lockup - CPU#4 stuck for 26s! [kworker/u40:7:570]
[  216.443977] Modules linked in: snd_seq_dummy snd_hrtimer mlx4_ib ib_uverbs ib_core nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr sunrpc vfat fat intel_rapl_msr iTCO_wdt pmt_telemetry intel_pmc_bxt ee1004 pmt_class mei_hdcp iTCO_vendor_support intel_rapl_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass rapl intel_cstate intel_uncore eeepc_wmi asus_wmi sparse_keymap platform_profile pcspkr rfkill wmi_bmof snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_soc_hdac_hda snd_hda_codec_hdmi snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi snd_hda_codec_realtek soundwire_bus snd_soc_core snd_hda_codec_generic
[  216.443997]  ledtrig_audio snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer i2c_i801 snd i2c_smbus soundcore mei_me mei idma64 mlx4_core joydev intel_pmt acpi_tad acpi_pad zram ip_tables i915 i2c_algo_bit ttm drm_kms_helper cec crct10dif_pclmul crc32_pclmul crc32c_intel drm r8169 nvme nvme_core ghash_clmulni_intel vmd wmi video pinctrl_alderlake fuse
[  216.444009] CPU: 4 PID: 570 Comm: kworker/u40:7 Kdump: loaded Not tainted 5.16.12+ #9
[  216.444011] Hardware name: ASUS System Product Name/PRIME Z690-P D4, BIOS 0407 09/13/2021
[  216.444011] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[  216.444015] RIP: 0010:pci_mmcfg_read+0xac/0xd0
[  216.444018] Code: 5d 41 5c 41 5d 41 5e 41 5f c3 4c 01 e0 66 8b 00 0f b7 c0 89 45 00 eb e0 4c 01 e0 8a 00 0f b6 c0 89 45 00 eb d3 4c 01 e0 8b 00 <89> 45 00 eb c9 e8 2a 4c 55 ff 5b c7 45 00 ff ff ff ff b8 ea ff ff
[  216.444019] RSP: 0018:ffffb3e701157c88 EFLAGS: 00000286
[  216.444020] RAX: 00000000ffffffff RBX: 0000000000701000 RCX: 0000000000000ffc
[  216.444020] RDX: 00000000000000ff RSI: 0000000000000007 RDI: 0000000000000000
[  216.444021] RBP: ffffb3e701157cc4 R08: 0000000000000004 R09: ffffb3e701157cc4
[  216.444021] R10: ffffb3e701157b18 R11: 0000000000000007 R12: 0000000000000ffc
[  216.444022] R13: 0000000000001000 R14: 0000000000000004 R15: 0000000000000000
[  216.444022] FS:  0000000000000000(0000) GS:ffff93092f300000(0000) knlGS:0000000000000000
[  216.444023] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  216.444023] CR2: 00007f09bdb3e4e0 CR3: 000000044e810002 CR4: 0000000000770ee0
[  216.444024] PKRU: 55555554
[  216.444024] Call Trace:
[  216.444025]  <TASK>
[  216.444026]  pci_bus_read_config_dword+0x36/0x50
[  216.444029]  pci_find_next_ext_capability.part.0.cold+0x87/0x93
[  216.444031]  pci_save_vc_state+0x25/0x90
[  216.444032]  pci_save_state+0x106/0x280
[  216.444034]  nvme_reset_work+0x313/0x12a0 [nvme]
[  216.444036]  ? resched_curr+0x20/0xb0
[  216.444038]  ? check_preempt_curr+0x2f/0x70
[  216.444039]  ? ttwu_do_wakeup+0x17/0x160
[  216.444040]  ? _raw_spin_unlock_irqrestore+0x25/0x40
[  216.444042]  ? try_to_wake_up+0x84/0x570
[  216.444043]  process_one_work+0x1e5/0x3c0
[  216.444045]  worker_thread+0x50/0x3b0
[  216.444046]  ? rescuer_thread+0x370/0x370
[  216.444047]  kthread+0x169/0x190
[  216.444048]  ? set_kthread_struct+0x40/0x40
[  216.444048]  ret_from_fork+0x1f/0x30
[  216.444051]  </TASK>
[  244.443980] watchdog: BUG: soft lockup - CPU#4 stuck for 52s! [kworker/u40:7:570]
[  244.443981] Modules linked in: snd_seq_dummy snd_hrtimer mlx4_ib ib_uverbs ib_core nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink qrtr sunrpc vfat fat intel_rapl_msr iTCO_wdt pmt_telemetry intel_pmc_bxt ee1004 pmt_class mei_hdcp iTCO_vendor_support intel_rapl_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass rapl intel_cstate intel_uncore eeepc_wmi asus_wmi sparse_keymap platform_profile pcspkr rfkill wmi_bmof snd_sof_pci_intel_tgl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd_sof snd_soc_hdac_hda snd_hda_codec_hdmi snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi snd_hda_codec_realtek soundwire_bus snd_soc_core snd_hda_codec_generic
[  244.444003]  ledtrig_audio snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer i2c_i801 snd i2c_smbus soundcore mei_me mei idma64 mlx4_core joydev intel_pmt acpi_tad acpi_pad zram ip_tables i915 i2c_algo_bit ttm drm_kms_helper cec crct10dif_pclmul crc32_pclmul crc32c_intel drm r8169 nvme nvme_core ghash_clmulni_intel vmd wmi video pinctrl_alderlake fuse
[  244.444016] CPU: 4 PID: 570 Comm: kworker/u40:7 Kdump: loaded Tainted: G             L    5.16.12+ #9
[  244.444017] Hardware name: ASUS System Product Name/PRIME Z690-P D4, BIOS 0407 09/13/2021
[  244.444017] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[  244.444019] RIP: 0010:pci_mmcfg_read+0xac/0xd0
[  244.444021] Code: 5d 41 5c 41 5d 41 5e 41 5f c3 4c 01 e0 66 8b 00 0f b7 c0 89 45 00 eb e0 4c 01 e0 8a 00 0f b6 c0 89 45 00 eb d3 4c 01 e0 8b 00 <89> 45 00 eb c9 e8 2a 4c 55 ff 5b c7 45 00 ff ff ff ff b8 ea ff ff
[  244.444021] RSP: 0018:ffffb3e701157c88 EFLAGS: 00000286
[  244.444022] RAX: 00000000ffffffff RBX: 0000000000701000 RCX: 0000000000000ffc
[  244.444023] RDX: 00000000000000ff RSI: 0000000000000007 RDI: 0000000000000000
[  244.444023] RBP: ffffb3e701157cc4 R08: 0000000000000004 R09: ffffb3e701157cc4
[  244.444024] R10: ffffb3e701157b18 R11: 0000000000000007 R12: 0000000000000ffc
[  244.444024] R13: 0000000000001000 R14: 0000000000000004 R15: 0000000000000000
[  244.444024] FS:  0000000000000000(0000) GS:ffff93092f300000(0000) knlGS:0000000000000000
[  244.444025] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  244.444025] CR2: 00007f09bdb3e4e0 CR3: 000000044e810002 CR4: 0000000000770ee0
[  244.444026] PKRU: 55555554
[  244.444026] Call Trace:
[  244.444027]  <TASK>
[  244.444027]  pci_bus_read_config_dword+0x36/0x50
[  244.444029]  pci_find_next_ext_capability.part.0.cold+0x87/0x93
[  244.444030]  pci_save_vc_state+0x25/0x90
[  244.444031]  pci_save_state+0x106/0x280
[  244.444033]  nvme_reset_work+0x313/0x12a0 [nvme]
[  244.444036]  ? resched_curr+0x20/0xb0
[  244.444037]  ? check_preempt_curr+0x2f/0x70
[  244.444038]  ? ttwu_do_wakeup+0x17/0x160
[  244.444039]  ? _raw_spin_unlock_irqrestore+0x25/0x40
[  244.444040]  ? try_to_wake_up+0x84/0x570
[  244.444042]  process_one_work+0x1e5/0x3c0
[  244.444043]  worker_thread+0x50/0x3b0
[  244.444044]  ? rescuer_thread+0x370/0x370
[  244.444045]  kthread+0x169/0x190
[  244.444045]  ? set_kthread_struct+0x40/0x40
[  244.444046]  ret_from_fork+0x1f/0x30
[  244.444048]  </TASK>
[  245.203988] nvme nvme4: Removing after probe failure status: -19
[  245.303991] pcieport 0000:00:1c.4: AER: Uncorrected (Non-Fatal) error received: 0000:07:00.1
[  245.463974] pcieport 0000:00:1c.4: AER: Uncorrected (Non-Fatal) error received: 0000:07:00.1

And immediately after I run the script, there is a new device nvme4 appears under /dev. After the log says Removing after probe failure status: -19, nvme4 disappers.

The Linux kernel I use is 5.16.12

Yiyuan-Dong commented 1 year ago

OK, it seems that the problem is all about the fireware. After I report the problem to the after-sales, he provided me with the latest PM173X firmware EPK9GB5Q. Now everything works fine with the latest firmware.