linux-nvme / nvme-cli

NVMe management command line interface.
https://nvmexpress.org
GNU General Public License v2.0
1.47k stars 655 forks source link

Fixed the SR-IOV fault of PM1733/PM1735. #1126

Closed daiaji closed 2 years ago

daiaji commented 3 years ago
nvme list-ctrl /dev/nvme0 -n2
num of ctrls present: 1
[   0]:0x41

nvme virt-mgmt /dev/nvme0n2 -c 0x0001 -r0 -n2 -a8
success, Number of Controller Resources Modified (NRM):0x2

nvme virt-mgmt /dev/nvme0n2 -c 0x0001 -r1 -n2 -a8
success, Number of Controller Resources Modified (NRM):0x2

nvme virt-mgmt /dev/nvme0n2 -c 0x0001 -a9
NVMe status: INVALID_CTRL_ID: An invalid Controller Identifier was specified.(0x11f)

nvme list-secondary /dev/nvme0n2
Identify Secondary Controller List:
   NUMID       : Number of Identifiers           : 32
   SCEntry[0  ]:
................
     SCID      : Secondary Controller Identifier : 0x0001
     PCID      : Primary Controller Identifier   : 0x0041
     SCS       : Secondary Controller State      : 0x0000 (Offline)
     VFN       : Virtual Function Number         : 0x0001
     NVQ       : Num VQ Flex Resources Assigned  : 0x0002
     NVI       : Num VI Flex Resources Assigned  : 0x0002
   SCEntry[1  ]:
daiaji commented 1 year ago

Does this version of firmware work for PM1733? I bought a second-hand PM1733, and I think Samsung's customer service should not deal with me.


I noticed that my firmware version book EPK9AB5Q and the new firmware version is EPK9GB5Q, which seems to mean that this firmware my pm1733 also works, can you please publish the firmware here?

Yiyuan-Dong commented 1 year ago

@daiaji I think it should work. Since the full name of the firmware is General_PM1733_EVT0_EPK9GB5Q.bin

I'd like to put the firmware file here. I think Samsung would also be glad if more people could successfully use the new features of their products.

General_PM1733_EVT0_EPK9GB5Q.zip

To activate the fireware.

sudo nvme reset /dev/nvme2
sudo nvme fw-download /dev/nvme2 --fw=General_PM1733_EVT0_EPK9GB5Q.bin
sudo nvme fw-commit /dev/nvme2 -s 0 -a 1
sudo nvme reset /dev/nvme2
sudo nvme id-ctrl /dev/nvme2 | grep fr

But it would be better if you check the documentation of nvme fw-commit and nvme fw-download first.

daiaji commented 1 year ago

图片

After updating the firmware, the glitch was fixed.

Since the namespace mounted on the secondary controller is hidden in the host, it means that even if there is an ESP partition in this namespace, it will not be displayed in the BIOS boot menu?

Now that the SR-IOV glitch is fixed, maybe I should buy a 7.68T PM1733?🤔

Because of the chia mining accident, these high-capacity SSDs are not expensive. But it seems that when I use SR-IOV, libvirt will not work.

PS:It seems that the SR-IOV of the SSD is a rarely used feature, otherwise this failure should have been fixed earlier.

It looks like it wasn't fixed until firmware version EPK9CB5Q.

@piotrekz79 How about trying to update the firmware? Your firmware version is EPK98B5Q and looks too old.

daiaji commented 1 year ago

@Yiyuan-Dong @keithbusch @igaw The NVMe specification mentions that NVMe has 7 firmware slots, isn't firmware slot 0 read-only?

What happens when I switch to a slot where firmware is not present?

Can I roll back firmware by switching firmware slots? I remember a read-only slot that saved the old firmware from the factory.

Yiyuan-Dong commented 1 year ago

@daiaji The firmware slot 0 means let controller choose the slot.

from nvme-cli/Documentation/nvme-fw-commit.txt

Firmware Slot: Specifies the firmware slot that shall be used for the Commit Action, if applicable. If the value specified is 0h, then the controller shall choose the firmware slot (slot 1 – 7) to use for the operation.

daiaji commented 1 year ago

@keithbusch @igaw Basically, I can't know which controller my namespace is associated with. Is there a way to query the number of the controller associated with the current namespace?

daiaji commented 1 year ago

local cid=$(nvme list-secondary $nvme_dev -o json | jq '."secondary-controllers"[]|select(."virtual-function-number"=='$(($2 + 1))')."secondary-controller-identifier"')

What is the $2 variable?

igaw commented 1 year ago

I've never worked with this part of the nvme spec so far. So I can't really say a lot. If I understand your problem correctly, it is not possible to figure out the mapping between the IDs exposed to userspace (as the kernel uses its own IDs) and the hardware/firmware IDs?

daiaji commented 1 year ago

@piotrekz79 @Yiyuan-Dong I noticed that the OP of my PM1733 is very large, so I can only use 6.4T. Is there a way for DC_Toolkit to modify the LBA?

iyanucodes commented 1 year ago

@Yiyuan-Dong hey do you by chance have access to the latest PM1733A firmware? MPPA5B5Q or MPPA3B5Q. I have these drives and I have been finding it extremely hard to get updates for them

Yiyuan-Dong commented 1 year ago

@Yiyuan-Dong hey do you by chance have access to the latest PM1733A firmware? MPPA5B5Q or MPPA3B5Q. I have these drives and I have been finding it extremely hard to get updates for them

I'm sorry, however, the firmware I have was given to me by someone else, and I don't have access to other versions of the firmware.

iyanucodes commented 1 year ago

@Yiyuan-Dong thanks for your quick response. Could this person by chance have access to other firmware?

Yiyuan-Dong commented 1 year ago

@iyanucodes That person is the after-sales service for the SSD, and he should have the corresponding version of the firmware. I have a feeling that he deals with similar issues frequently.

iyanucodes commented 1 year ago

@Yiyuan-Dong If there's anyway that you could please reach out, I will be forever grateful

nguoido commented 1 year ago

I have a question. Can I use SR-IOV on Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983?

[root@fedora ~]# nvme list-secondary /dev/nvme2 NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x2002

daiaji commented 1 year ago

I have a question. Can I use SR-IOV on Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983?

[root@fedora ~]# nvme list-secondary /dev/nvme2 NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x2002

SR-IOV are generally an expensive feature even in enterprise-class solutions, so only PM1725 and PM1733/1735 have SR-IOV.

nguoido commented 7 months ago

General_PM1733_EVT0_EPK9GB5Q.bin

I have a question for @Yiyuan-Dong . Where do you find "General_PM1733_EVT0_EPK9GB5Q.bin"?

Yiyuan-Dong commented 7 months ago

General_PM1733_EVT0_EPK9GB5Q.bin

I have a question for @Yiyuan-Dong . Where do you find "General_PM1733_EVT0_EPK9GB5Q.bin"?

I reported my problem to the after-sales staff, and they sent me the new firmware

nguoido commented 7 months ago

Hi @iyanucodes Can you use SRIOV nvme with your devices (fw: MPPA5B5Q or MPPA3B5Q)?

roolebo commented 3 months ago

@Yiyuan-Dong @0xabu Thanks for your research on the topic. I have noticed a few subtle issues in some of the messages and provided examples did not really work for me on Ubuntu 24.04 (6.8) kernel. I consolidated VF population in a single script that's (hopefully) easy to use: https://gist.github.com/roolebo/32ffdbdede0f3c5ada949973ec195a15

One thing that was not mentioned anywhere in the posts is that FLR (Function-Level Reset) is mandatory for VF before moving it to Online state, otherwise it would cause issues observed by @0xabu:

I found that you need to enable all 32 VFs (basically, cat sriov_totalvfs > sriov_numvfs). If you enable fewer, then the nvme virt-mgmt ... -a 9 command always fails to bring the secondary controller online. In the process of futzing around, I also got the controller into an unhappy state that was only resolved after a whole-system reboot, so maybe try that too if you haven't already.

It's not mandatory to create 32 VFs. Any number of VFs from 1 to 32 works.

andrradar commented 6 days ago

Has anyone had any luck making SR-IOV available in vSphere 8 for PM1733?