mwrnd / innova2_flex_xcku15p_notes

Nvidia/Mellanox Innova-2 Flex Open Programmable SmartNIC Setup and Usage Notes for XCKU15P FPGA Development
BSD 2-Clause "Simplified" License
49 stars 7 forks source link

Instruction to flash the firmware when lspci shows "MT28800 Family [ConnectX-5 Flash Recovery]" #2

Closed yangl1996 closed 11 months ago

yangl1996 commented 1 year ago

Thanks for such detailed documentation! I just started playing around a bit with the card. Mine arrived with no firmware and lspci shows only one device MT28800 Family [ConnectX-5 Flash Recovery]. The instructions in the notes do not work for me. Here's what I ended up doing:

  1. Use FreeBSD. (I tried Linux and it does not work. flint complains that it cannot find device ID. Source code of mstflint shows that the tool does not try to obtain device ID on FreeBSD, thus not producing the error.)
  2. Install the mstflint port. (Nvidia does provide MFT tools on its website, but I found the open source version provided by FreeBSD ports is sufficient. If one does want to install Nvidia's version, one needs to ln -s /usr/local/bin/bash /bin/bash since Nvidia's version assumes that bash is present at /bin/bash which is not the case on FreeBSD.)
  3. Download the firmware.
  4. Run mstflint -nofs --use_image_ps --ignore_dev_data -d pci0:3:0:0 -i /root/Innova_2_Flex_Open_18_12/FW/Morse_FW/fw-ConnectX5-rel-16_24_4020-MNV303212A-ADL_Ax.bin burn. Here pci0:3:0:0 is the PCI address of Flash Recovery device.
  5. Flash a random GUID and MAC.
  6. Power cycle the board by turning off and on the host computer. A restart does not work since board will not be power cycled.

The command is mostly a copy paste from https://github.com/Tualua/labnotes/wiki/Mellanox-ConnectX-4-Lx-Firmware-recovery but I thought it might be useful to put it here to benefit folks like me who have not tried to recovery a Mellanox NIC.

LMK if you would prefer that I submit a PR.

mwrnd commented 1 year ago

Doubly thanks yangl1996! I was unaware of mstflint's abilities. I recall trying various older versions of flint but none would allow the PSID mismatch. I have added Recovery Mode as an option for programming the Innova-2 ConnectX-5 firmware.

I got the board to enter Recovery Mode by shorting CLK to Vcc by accident. I believe DO to GND is safer.

Recovery_Mode_ConnectX-5_Firmware_25Q128_FLASH_by_Shorting_Pins_2-4

I am surprised by this as at one point I erased the 25Q128 FLASH on the Innova-2 and just the bridge showed up in lspci.

lspci_results_when_FLASH_IC_fails

@yangl1996 were you able to get anything from mst status or flint --device /dev/mst/mt525_pciconf0 query while the board was showing up as MT28800 Family [ConnectX-5 Flash Recovery]? I was hoping I could recreate your sequence of events.

yangl1996 commented 1 year ago

Thanks, Matthew for updating the docs! It's very interesting that one can force the board into recovery without dedicated tools.

My card arrived with only the Flash Recovery device shown in lspci. It's curious that there is also a config where only a PCIe switch shows up---maybe even the Flash Recovery firmware is erased?

I used mlxfwmanager instead of mst status. Both mlxfwmanager and mst are only present in Nvidia's version of MFT, not FreeBSD's mstflint port. However the mst status command is not available on the FreeBSD version of MFT. Instead mlxfwmanager has a similar command that allows one to query all Mellanox devices on the host. The command is able to discover the device and print out its PCI address. (That's how I learned the PCI address format that Mellanox tools expect. Once one learns the address format, it is no longer necessary to install MFT from Nvidia---one can use pciconf -lv or lspci to find the address and use mstflint from the FreeBSD port to flash the card.) It also correctly detected the version of the chip (ConnectX-5).

mstflint -d <pci address of the flash recovery device> query successfully queries the device. However, most of the fields are N/A. I cannot recall exactly which fields, but at least Image type, FW Version, FW Release Date, MAC, and GUID. The PSID is also corrupted---maybe N/A, but definitely not MT_0000000158.

I made multiple failed attempts to flash the card, until I found the article mentioned in my original post. The trick seems to be getting through (by disabling) the multiple safety checks that mstflint has for firmware flashing. One needs to use some exact combination of options, and the error messages are not very helpful. (I tried to decipher the errors by going through the codebase but it's too much work.) At some point I also tried to boot Ubuntu and use the Linux versions of MFT tools, but I encountered the issue with device ID described in the original post. The issue is present in flint (from Nvidia) and mstflint (installed from a Ubuntu package). The exact same command worked after I switched back to FreeBSD.

mwrnd commented 1 year ago

I tried erasing (0xFF) the 25Q128 FLASH IC as well as writing all 0x00 to it and in both cases it boots into Flash Recovery Mode. I have no idea how I previously got it to boot with just the one PCIe bridge device visible.

25Q128_Write_All_0x00

mwrnd commented 1 year ago

I discovered it is possible to corrupt the firmware using mstflint and then program the ConnectX-5 without any problems.