mwrnd / innova2_flex_xcku15p_notes

Nvidia/Mellanox Innova-2 Flex Open Programmable SmartNIC Setup and Usage Notes for XCKU15P FPGA Development
BSD 2-Clause "Simplified" License
49 stars 7 forks source link

ConnextX missing - should I flash? #8

Closed AGenchev closed 9 months ago

AGenchev commented 11 months ago

Hello, someone handed me a lenovo-made innova2. It is missing the connect-x device. Should I try unconventional methods on it to flash another image than the supplied by lenovo so development could be possible ? I think there is a problem with the card, but I can't diagnose it for now. Note: To install all required software versions, I set up a virtual machine where I added the 2 mlx PCI-E devices using pci-e passthrough. There is no third device neither on the host nor in the guest system.

sudo flint -d /dev/mst/mt4119_pciconf0 q
Image type:            FS4
FW Version:            16.24.4020
FW Release Date:       3.1.2019
Product Version:       16.24.4020
Rom Info:              type=UEFI version=14.17.13 cpu=AMD64
                       type=PXE version=3.5.603 cpu=AMD64
Description:           UID                GuidsNumber
Base GUID:             aaaaaa0300aaaaa0        12
Base MAC:              aaaaaaaaaaa0            12
Image VSD:             N/A
Device VSD:            N/A
PSID:                  LNV0000000025
Security Attributes:   secure-fw
Sudo ~/Innova_2_Flex_Open_18_12/app/innova2_flex_app -v
===============================================
 Verbosity:        1
 BOPE device:      None
 ConnectX device:  None
Cannot find appropriate ConnectX device
sudo ./mlxup 
Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      ConnectX5
  Part Number:      SN37A48123_Ax
  Description:      ThinkSystem Mellanox Innova-2 ConnectX-5 FPGA 25GbE 2-port PCIe Adapter
  PSID:             LNV0000000025
  PCI Device Name:  /dev/mst/mt4119_pciconf1
  Base GUID:        aaaaaa0300aaaaa0
  Base MAC:         aaaaaaaaaaa0
  Versions:         Current        Available     
     FW             16.24.4020     N/A           
     PXE            3.5.0603       N/A           
     UEFI           14.17.0013     N/A           

  Status:           No matching image found
.... ( the same for device 2)

my final goal is to setup a device on the fpga (w bus-master DMA) which will generate test data and stream it to the host memory. I'm still trying to learn what is QDMA and how it's different from XDMA, but for now I don't have what to test on :-)

mwrnd commented 11 months ago

I set up a virtual machine where I added the 2 mlx PCI-E devices using pci-e passthrough.

Please try running the board directly. A ~240GB SSD is enough for a complete Ubuntu system with Vivado. The tools require access to the PCIe Bridges as well.

lspci -nn | grep "Mellanox\|Xilinx"
lspci -tv | grep "0000\|Mellanox\|Xilinx"
01:00.0 PCI bridge [0604]: Mellanox Technologies MT28800 Family [ConnectX-5 PCIe Bridge] [15b3:1974]
02:08.0 PCI bridge [0604]: Mellanox Technologies MT28800 Family [ConnectX-5 PCIe Bridge] [15b3:1974]
02:10.0 PCI bridge [0604]: Mellanox Technologies MT28800 Family [ConnectX-5 PCIe Bridge] [15b3:1974]
03:00.0 Memory controller [0580]: Xilinx Corporation Device [10ee:9038]
04:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
04:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]

-[0]-+-00.0 Intel Corporation Device 3e0f
     +-1d.0-[01-04]----00.0-[02-04]--+-08.0-[03]----00.0 Xilinx Corporation Device 9038
     |                               \-10.0-[04]--+-00.0 Mellanox Technologies MT27800 Family [ConnectX-5]
     |                                            \-00.1 Mellanox Technologies MT27800 Family [ConnectX-5]

lspci for Innova2

Unfortunately the easiest course of action is to update the ConnectX-5 firmware. In its current state the Innova2 Flex Application and Drivers will only communicate with PSID: MT_0000000158. Most of the tools compile from source so it may be possible to modify them to work but I do not know if the Lenovo ConnectX-5 firmware has any changes compared to the official Nvidia/Mellanox firmware.

Please save multiple copies of the current firmware and just in case test the network ports both before and after updating the firmware. 0?:00.0 is the PCIe Bus ID of the first Innova2 Ethernet Controller from lspci.

sudo mstflint --device 0?:00.0  ri  innova2_CX5_FW_read1.bin
sudo mstflint --device 0?:00.0  ri  innova2_CX5_FW_read2.bin

The Lenovo Firmware Release for the Innova2(mlnx-lnvgy_fw_nic_in2-16.24.4020_linux_x86-64.bin) is an update program for Linux. The fw-ConnectX5-rel-16_24_4020-MNV303212A-ADL_Ax.bin Innova2 firmware from Innova_2_Flex_Open_18_12.tar.gz is the raw binary for the 16MB W25Q128JVS Flash IC on the board. md5sum and ls -l:

143ab6f92b6c20fc833c2243103e118b  mlnx-lnvgy_fw_nic_in2-16.24.4020_linux_x86-64.bin
27193323 Nov 27 16:10             mlnx-lnvgy_fw_nic_in2-16.24.4020_linux_x86-64.bin

b296e6a0b95c964c09c31d5e90242058  fw-ConnectX5-rel-16_24_4020-MNV303212A-ADL_Ax.bin
16777216 Jan  6  2019             fw-ConnectX5-rel-16_24_4020-MNV303212A-ADL_Ax.bin

Also, if you have a Xilinx-compatible JTAG adapter, save the current FPGA contents by running the following command in Vivado's Tcl Console (Vivado Lab will work as well and 2023.1 currently works):

readback_hw_device [current_hw_device]  -readback_file zu19eg_u25_r.rbd  -bin_file zu19eg_u25_b.bin

Vivado Hardware Manager

Hardware Manager AutoConnect

Hardware Manager Configuration Memory Readback

Please post the output of sudo lspci -vvvnnxxx for your board. Note the only personally identifying information is [SN] Serial number: and [V3] Vendor specific: which you should zero.

Thanks for posting @AGenchev, PSID: LNV0000000025 is new to me. It does appear you have a rebranded MNV303212A-ADLT so you should be fine to follow the innova2_flex_xcku15p_notes from beginning to end to start developing with the Innova2's FPGA.

goal is to setup a device on the fpga (w bus-master DMA) which will generate test data and stream it to the host memory.

Great goal. I would be very interested in your progress.

what is QDMA and how it's different from XDMA

QDMA has features for multi-user cloud computing like virtual functions and is intended for low-latency small-packet performance. XDMA is much simpler and designed for high-throughput bulk data transfer. Check out innova2_xdma_demo and qdma_2021_1.

AGenchev commented 10 months ago

Thank you so much for the invaluable information ! You really inspire me to go further. Finally I managed to obtain a Xilinx-DLC10 JTAG adapter, installed the recommended ubuntu, vitis+vivado & all the drivers. I'm not sure why I need all drivers if I'm not using the network card. My card looks not fully functional right out of the box, probably the firmware is not OK. It doesn't create network adapters in Linux:

[  121.336005] mlx5_core 0000:03:00.0: mlx5_function_setup:1236:(pid 182): Firmware over 120000 MS in pre-initializing state, aborting
[  121.336142] mlx5_core 0000:03:00.0: init_one:1812:(pid 182): mlx5_load_one failed with error code -16
[  121.336814] mlx5_core: probe of 0000:03:00.0 failed with error -16

sudo lspci -vvvnnxxx output (https://gist.github.com/AGenchev/f466b047b2a9cd13bc1114208c3828d2#file-gistfile1-txt) I forced recovery by shortening flash pins. Flint refuses to flash: -E- PSID mismatch. The PSID on flash (LNV0000000025) differs from the PSID in the given image (MT_0000000158). I corrupted the firmware, hoping that this will help. Flint refuses w the same message - the address where the FW is corrupted is not wherhe the version is stored so its logic didn't change. I decided to bypass the flint logic - ran the command and after it started reading shorted DO. It failed to read and was ready to flash:

Current FW version on flash:  N/A
    New FW version:               16.24.4020
Burn process will not be failsafe.....

But, it failed me again: -E- Burning FS4 image failed: Cannot burn device data sections, Flash is write protected. So it seems I'm out of luck. Still, I can find/buy a W25Q128JVS and try replacing it on the PCB...

AGenchev commented 10 months ago

My next update: "A new hope" Here is the prepared bed for the new W25Q128JVS IC. Thanks God, I didn't destroy the traces. IMG_20231215_192346 After my imperfect soldering job: IMG_20231215_194702 And the moment of truth shows it is more or less soldered:

~/Innova_2_Flex_Open_18_12/FW/Morse_FW$ sudo mstflint --nofs --use_image_ps --ignore_dev_data  --device 03:00.0  --image fw-ConnectX5-rel-16_24_4020-MNV303212A-ADL_Ax.bin  burn

    Current FW version on flash:  N/A
    New FW version:               16.24.4020

Burn process will not be failsafe. No checks will be performed.
ALL flash, including the device data sections will be overwritten.
If this process fails, computer may remain in an inoperable state.

 Do you want to continue ? (y/n) [n] : y
Burning FW image without signatures - OK  
Burning FW image without signatures - OK  
Restoring signature                     - OK
-I- To load new FW run reboot machine.
gele@gele-HP-WS:~/Innova_2_Flex_Open_18_12/FW/Morse_FW$ sudo flint --device /dev/mst/mt525_pciconf0 query
Image type:            FS4
FW Version:            16.24.4020
FW Release Date:       3.1.2019
Product Version:       rel-16_24_4020
Description:           UID                GuidsNumber
Base GUID:             N/A                     12
Base MAC:              N/A                     12
Image VSD:             N/A
Device VSD:            N/A
PSID:                  MT_0000000158
Security Attributes:   N/A

now I have the preffered PSID (MT_0000000158). Flashed the guids using the same device (not rebooted): sudo flint --device /dev/mst/mt525_pciconf0 -guid 0xababababaaa0 -mac 0xabaabab0 sg

After cold reboot, mlx drivers are still unhappy as they were before (mlx5_core hangs 2 times x3 minutes). And in the end gives me:

mlx5_core 0000:03:00.0: mlx5_function_setup:1236:(pid 178): Firmware over 120000 MS in pre-initializing state, aborting
[  121.296145] mlx5_core 0000:03:00.0: init_one:1812:(pid 178): mlx5_load_one failed with error code -16

Could it be because the ports don't have SFP modules attached ? Should I put in your example guids: sudo flint --device /dev/mst/mt4119_pciconf0 -guid 0xc0dec0dec0dec0de -mac 0xc0dec0dec0de sg Something prevents firmware from initializing ? The flex app also doesn't like what it sees:

sudo ~/Innova_2_Flex_Open_18_12/app/innova2_flex_app -v
===============================================
 Verbosity:        1
 BOPE device:      None
 ConnectX device:  None
Cannot find appropriate ConnectX device

HW manager (JTAG) also errors:

 [Labtools 27-2269] No devices detected on target localhost:3121/xilinx_tcf/Xilinx/00001e52190301.
Check cable connectivity and that the target board is powered up then
use the disconnect_hw_server and connect_hw_server to re-register this hardware target.

Ran mlxup, after the following I gave it perm. to update to 16.28.2006:

  Versions:         Current        Available     
     FW             16.24.4020     16.28.2006    
     PXE            N/A            3.6.0102      
     UEFI           N/A            14.21.0017    
  Status:           Update required

(Thoughts for if/after I manage to solve the issues: According your explanation, XDMA looks good for my case, but I'm still far from it (have to learn). I wonder what is xilinx's idea with providing me the xdma and qdma driver - to extract how it works from source and write my own or to reuse it somehow by writing only specific bits. QDMA "driver" seems too complex to rewrite or even understand. I only got that it has multiple queues and channels which run in "parallel" using scatter-gather. Will focus (later) on XDMA trying to get whether it can signal somehow to the driver using MSI for half-transfer complete or full transfer complete or something like that. Also PCI-E p2p transfers are very interesting topic.)

mwrnd commented 10 months ago

I'm not sure why I need all drivers if I'm not using the network card.

innova2_flex_app requires access to the first network interface to enable/disable JTAG and other features. The ConnectX-5 controls which FPGA bitstream to run.

how long the pins should be shorted

Until the driver for Recovery Mode is loaded. I fixed the notes to make that more clear.

Flint refuses to flash: -E- PSID mismatch. ... (LNV0000000025) differs ... (MT_0000000158)

Yes I recently tried the procedure on a board with a PSID different than MT_0000000158 and had a similar experience. I was eventually able to get it to work using mstflint's Write Block (wb) command.

I decided to bypass the flint logic - ran the command and after it started reading shorted DO. ... Flash is write protected.

This is very difficult to get right as flint reads from the same addresses multiple times. My guess is that when it was unable to read back data it had written it assumed the IC is write-protected.

After cold reboot, mlx drivers are still unhappy ... Firmware over 120000 MS in pre-initializing state, aborting

This one is new for me. It is my understanding that the W25Q128JVS gets loaded once at boot then resides in CX5 memory.

Ran mlxup, after the following I gave it perm. to update to 16.28.2006

The ConnectX-5 works well enough to update its own firmware which is good news.

What is the FBGA Code on your board's DDR4 ICs (D9TBK or D9WFR)?

Innova2_Variant_DDR4_Comparison

What is your cooling solution? The ICs get very hot very quickly which leads to all sorts of problems. Post a picture of your cooling setup.

Could it be because the ports don't have SFP modules attached?

That would just make the board hotter. I run mine with no SFP modules or cables and it works fine.

Should I put in your example guids: ... 0xc0dec0dec0dec0de ...

Did your board not include a label with the MAC IDs? use the values you saw when you first ran flint query. However, this should not matter.

Description:           UID                GuidsNumber
Base GUID:             aaaaaa0300aaaaa0        12
Base MAC:              aaaaaaaaaaa0            12

Something prevents firmware from initializing?

Yes. I suspect it will take a while to figure out why.

What is the motherboard and CPU you are using? Does it support at least PCIe 3.0?

The flex app also doesn't like what it sees:

Yes, without the ConnectX-5 Ethernet Controllers there is no way for the drivers needed by innova2_flex_app to load.

sudo mst start
cd ~/Innova_2_Flex_Open_18_12/driver/
sudo ./make_device
sudo insmod /usr/lib/modules/`uname -r`/updates/dkms/mlx5_fpga_tools.ko
lsmod | grep mlx
cd ~
sudo ~/Innova_2_Flex_Open_18_12/app/innova2_flex_app -v

What does lspci show?

sudo lspci -nn | grep "Mellanox\|Xilinx"
sudo lspci -tv | grep "0000\|Mellanox\|Xilinx"

lspci_view_of_innova2

According your explanation, XDMA looks good for my case, but I'm still far from it (have to learn)

I have been working on a Tutorial for XDMA and XDMA Communication. PCIe P2PDMA is a complex project I have not yet been able to get working.

AGenchev commented 10 months ago

I wrote in chronological order; "..Flash is write protected." IMHO was caused by write protection indeed, because if you saw my first flint query, it said at the bottom: Security Attributes: secure-fw. I thought they might have used the lock bits in the flash... I released the DO pin after flint agreed that there is nothing so it will flash it. It really failed to write there so I had to replace the flash with new.

What is the FBGA Code on your board's DDR4 ICs (D9TBK or D9WFR)?

It seems you have nailed it: IMG_20231216_102303 It is for "ADIT" variant.

What is your cooling solution?

I simply attached 2 main board north-bridge radiators which have springs, using elastic rubber band to hold them in place and to provide pressure on the springs. Against the smaller one I put a blower fan, which provides enough airflow. IMG_66 Yesterday I ran mlxup to update FW to 16.28.2006. After this, the drivers are loading, firmware initialized O.K. Today we have a whole lot different picture:

sudo lspci -tv | grep "0000|Mellanox|Xilinx"

-+-[0000:7f]-+-08.0  Intel Corporation Xeon E5/Core i7 QPI Link 0
 \-[0000:00]-+-00.0  Intel Corporation Xeon E5/Core i7 DMI2
             +-01.0-[03-06]----00.0-[04-06]--+-08.0-[05]----00.0  Mellanox Technologi es Innova-2 Flex Burn image
             |                               \-10.0-[06]--+-00.0  Mellanox Technologies MT27800 Family [ConnectX-5]
             |                                            \-00.1  Mellanox Technologies MT27800 Family [ConnectX-5]

sudo lspci -nn | grep "Mellanox|Xilinx"

03:00.0 PCI bridge [0604]: Mellanox Technologies MT28800 Family [ConnectX-5 PCIe Bridge] [15b3:1974]
04:08.0 PCI bridge [0604]: Mellanox Technologies MT28800 Family [ConnectX-5 PCIe Bridge] [15b3:1974]
04:10.0 PCI bridge [0604]: Mellanox Technologies MT28800 Family [ConnectX-5 PCIe Bridge] [15b3:1974]
05:00.0 Class [2000]: Mellanox Technologies Innova-2 Flex Burn image [15b3:0264]
06:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
06:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]

The App could start, reads temperatures:

Your choice: 4
*** FPGA Temperature: 30 C
*** ConnectX Temperature: 53 C

With FPGA power increased to 10 (what was that - overclock?) it went to 80 C. All tests from innova2_flex_app fail (on my 4GB version). I think with 4GB it still has plenty of memory to experiment with. Only lack of documentation can be a culprit. So I will try to load your test for 8-bit DDR-4. For now, JTAG doesn't work for me, probably I don't try hard enough (should I replace the MOSFET), so I'm using these steps: disable JTAG and schedule the Flex image, reboot, program with innova2_flex_app, reboot, and then test your loaded bitstream.

sudo mst start
cd ~/Innova_2_Flex_Open_18_12/driver/
sudo ./make_device
sudo insmod /usr/lib/modules/`uname -r`/updates/dkms/mlx5_fpga_tools.ko
sudo ~/Innova_2_Flex_Open_18_12/app/innova2_flex_app -v \
-b innova2_xdma_demo_primary.bin,0                    \
-b innova2_xdma_demo_secondary.bin,1

...reboot..

lspci -d 10ee:
05:00.0 Memory controller: Xilinx Corporation Device 9038

Sending 8k data to M_AXI BRAM Controller Block succeeded first time, then fails, same for the read op. The read/write ops are very slow: Avg time device /dev/xdma0_c2h_0, total time 278250040 nsec, avg_time = 278250048.000000, size = 8192, BW = 0.029441 but the check sums (of written and read 8k blocks) match. All subsequent operations on /dev/xdma0_h2c_0 end with "Unknown error 512" in xdma. Accesses to /dev/xdma0_user work as expected - the LED turns on and off.

mwrnd commented 10 months ago

Flash is write protected ... Security Attributes: secure-fw

That makes sense and thanks for pointing it out. I have not come across this.

With FPGA power increased to 10 (what was that - overclock?)

Likely. My guess is that the FPGA runs some large DSP-heavy project.

it went to 80 C

That still seems safe.

All tests from innova2_flex_app fail (on my 4GB version).

Most fail for me on the 8GB ADLT variant as well. Luckily the tests have no relevance to the usefulness of the Innova-2.

I think with 4GB it still has plenty of memory to experiment with.

Yes, 4GB is enough for everything you could try with these boards.

The 8GB ADLT has both SFP GbE interfaces connected to the ConnectX-5 with the intent of using PCIe P2PDMA for communication between the FPGA and Ethernet.

Innova-2 Overview

From the 4GB ADAT Product Brief(manualzz.com) it appears to have at least one direct FPGA SFP interface. This is good news but it means that the 8GB ADLT firmware is not designed for it. innova2_flex_app requires the 8GB ADLT firmware to be able to program the FPGA but it means the ConnectX-5 is not likely to be usable with the ADIT/ADAT variants. The 8GB ADLT firmware also chooses which FPGA Bitstream Image gets loaded; User, Flex, or Factory.

Innova-2 MNV303212A-ADAT Architecture

Issue #3 has more on the differences between the boards.

Does your board have any markings that indicate it is the ADAT or ADIT?

I think with 4GB it still has plenty of memory to experiment with. Only lack of documentation can be a culprit. I will try to load your test for 8-bit DDR-4.

Yes, the bitstream from test_adit_mt40a512m16 will test 512MB of the 4096MB. I tried a straightforward port of innova2_8gb_adlt_xdma_ddr4_demo to the MT40A512M16 (D9TBK) but it fails. I need to run more tests to figure out why.

Yesterday I ran mlxup to update FW to 16.28.2006. After this, the drivers are loading, firmware initialized O.K

Great news!

For now, JTAG doesn't work for me ... so I'm using these steps: disable JTAG and schedule the Flex image, reboot, program with innova2_flex_app, reboot, and then test your loaded bitstream.

Yes, that is the correct sequence to program a Configuration Flash Memory (MT25QU512 RW193 ICs) bitstream using innova2_flex_app. This is what gets loaded during system boot. It is best to disconnect or at least power off your JTAG adaptor when using innova2_flex_app.

Using JTAG to load a bitstream into the FPGA's Temporary SRAM Configuration Memory is useful for development but not required.

05:00.0 Class [2000]: Mellanox Technologies Innova-2 Flex Burn image [15b3:0264]

It looks like your FPGA Flash Configuration Memory already includes the latest Flex Image. What version does innova2_flex_app say it is at?

Flex Image Version

should I replace the MOSFET

No, your board seems to work. I had clear damage to my board.

For now, JTAG doesn't work for me ...

After 05:00.0 Memory controller: Xilinx Corporation Device 9038 shows up, run innova2_flex_app to Enable JTAG. Does that help? Are you able to use XSDB to communicate with the FPGA?

Accesses to /dev/xdma0_user work as expected - the LED turns on and off.

That is good news.

Sending 8k data to M_AXI BRAM Controller Block succeeded first time, then fails, same for the read op. The read/write ops are very slow: but the check sums (of written and read 8k blocks) match.

More good news, XDMA is working for at least some time. My first instinct is that something is overheating and therefore fails after some time. Do you have access to a Thermal Imaging Camera or Freeze Spray? The idea with Freeze Spray is that you spray the board and the hottest parts will return to normal fastest.

I notice you have an Nvidia Tesla in your system. Are you able to remove it and try testing again?

All subsequent operations on /dev/xdma0_h2c_0 end with "Unknown error 512" in xdma.

The driver cannot access the AXI Bus.

If removing the Tesla card makes no difference or you cannot, try updating dma_ip_drivers.

Uninstall dma_ip_drivers, then try the latest November 10, 2023, commit a93d4a4 version.

cd ~/dma_ip_drivers/XDMA/linux-kernel/xdma/
sudo make uninstall
cd ~
wget https://codeload.github.com/Xilinx/dma_ip_drivers/zip/a93d4a4870e41d152b33aebb3f869eefb11aa691 -O dma_ip_drivers-a93d4a4.zip
unzip dma_ip_drivers-a93d4a4.zip
mv dma_ip_drivers-a93d4a4870e41d152b33aebb3f869eefb11aa691 dma_ip_drivers

cd ~/dma_ip_drivers/XDMA/linux-kernel/xdma/
make DEBUG=1
sudo make install

sudo depmod -a
sudo ldconfig

cd ~/dma_ip_drivers/XDMA/linux-kernel/tools
make

sudo reboot

Do you have access to a large ultrasonic cleaner, a well-ventilated space such as a balcony or garage, and 99% Isopropyl Alcohol? It has been my experience that old server equipment that is flaky can sometimes become usable after about a half hour of ultrasonic cleaning in Isopropyl Alcohol.

AGenchev commented 10 months ago

What version does innova2_flex_app say it is at?

*** FPGA image version: 0xc1

...but it means the ConnectX-5 is not likely to be usable with the ADIT/ADAT variants.

my goal is to use ConnectX-5 as .. just a PCI-E bridge to allow the FPGA on the PCI-E bus.

Yes, the bitstream from test_adit_mt40a512m16

I flashed it and succeeded in running your test at the point "I ran a sanity check and tried sending 1GB of data. Only 512MB transfers as expected."

sudo ./dma_from_device --verbose --device /dev/xdma0_c2h_0 --address 0x0 --size 1073741824 -f recv.raw  
dev /dev/xdma0_c2h_0, addr 0x0, aperture 0x0, size 0x40000000, offset 0x0, count 1
host buffer 0x40001000, 0x7fe0b4b6d000.
/dev/xdma0_c2h_0, read underflow 0x20000000/0x40000000 @ 0x0.
#0: underflow 536870912/1073741824.
#0: CLOCK_MONOTONIC 10.724906794 sec. read 536870912/1073741824 bytes

The results are repeatable, xdma works. The speed is ~ 50 MB/s when 512 MB of 1 GB are transferred which seems not too high. I didn't verify if the contents written/read match. Now (I guess) the unknown is how the memory is connected to the FPGA and how it is organized to properly implement the memory controller with 72-bit bus E.g. pin "constraints" are not known - am I right ? MT40A512M16 means that a chip is 512 Meg x 16. The address register is 19-bit (including bank address and bank group bit) and the data bus seems to be 16-bit per chip...

The JTAG doesn't run as expected. The last test I did was to enable JTAG on the application and then to run xsdb% targets but nothing shows up:

tcfchan#0
xsdb% after 7000                                                                                                                        
xsdb% targets                                                                                                                           
  1  whole scan chain (board power off)
xsdb% target 1                                                                                                                          
xsdb% fpga -state                                                                                                                       
No supported FPGA device found

It doesn't discover the FPGA. Can I reverse the connector

mwrnd commented 10 months ago

the bitstream from test_adit_mt40a512m16 I flashed it and succeeded in running your test ... I didn't verify if the contents written/read match.

Check that the data transfers correctly.

More and more of the board is proving to work. Let's hope the success continues.

speed is ~ 50 MB/s when 512 MB

Please post the output of sudo lspci -nnvv for the Mellanox and Xilinx devices to confirm the links are all at the correct Width and Speed and whether there have been any RX and/or TX errors.

72-bit bus E.g. pin "constraints" are not known - am I right ?

Constraints for the 8GB ADLT are known. When I use the same constraints in innova2_8gb_adlt_xdma_ddr4_demo but change the Memory Part of the DDR4 IP Block to MT40A512M16LY-075 and load it into the 4GB ADIT I get incorrect data reads. When I then connect to the 4GB ADIT board via JTAG and access MIG_1 I get MIG Status: PASS.

If you have the time, try recreating innova2_8gb_adlt_xdma_ddr4_demo but change the DDR4 Block to use MT40A512M16LY-075 and the memory range to 4GB. It may be that my board has faulty memory ICs.

The JTAG doesn't run as expected.

Are you able to communicate with other Xilinx FPGAs using your debugger?

I am on my second 14-Pin JTAG cable(1). With everything powered off, check with a multimeter in continuity mode that all the JTAG signals connect between the Innova2 and your JTAG adapter and are not shorted to GND.

AGenchev commented 10 months ago

Thank you for the guidance! I will be able to do all this after 7th JAN 2024, because I am on vacation far from home till then w/o access to IT equipment.

mwrnd commented 10 months ago

I got DDR4 working on my Innova2 4GB ADIT board. The innova2_4gb_adit_xdma_ddr4_demo project includes bitstreams and instructions for testing.

My problem turned out to be a reset issue. When I tried porting the innova2_8gb_adlt_xdma_ddr4_demo project to the 4GB ADIT I also moved from Vivado 2023.1 to 2023.2 which changed how my resets worked. I carefully updated the reset network and it now works and should be more robust.

AGenchev commented 10 months ago

I got DDR4 working on my Innova2 4GB ADIT board. The innova2_4gb_adit_xdma_ddr4_demo project includes bitstreams and instructions for testing.

This is really good news ! Thank you so much ! I'll try it. I only found time to test the JTAG adapter vs another board (an old NETFPGA-1G-CML) and it worked. This means I have also kind of problematic JTAG port on the innova2.

I also moved from Vivado 2023.1 to 2023.2 ....

This means that version 2023.2 also works and generates working images. I have installed 2021. Next evening I'll test whether the JTAG connector pins aren't grounded and whether your bitstream works (even if the JTAG isn't working). Why I'm trying to get JTAG working - to skip reboot when I want to set a new image to the FPGA. If possible I'd like to be able to load a bitstream in the FPGA without flashing it on the flash memory. Update: Your latest design works indeed beautifully on my board. The speed using dd to transfer full 4G of data reached 1.8 GB/s (the bus is PCI-E 3x8). The data retention is OK, checksums written/read match.

AGenchev commented 10 months ago

Now I'm trying to understand some basics: the block scheme of 4gb_adit_xdma_ddr4_demo looks rather complex, so it seems it does more than just being a XDMA to int.DDR4 memory controller. I see for the memory you set a different arbitration scheme (RD PRI REG) on the ADIT board while on ADLT. I have to read much more to get what is going on. I think to create a very basic block design using the DDR-4 guidelines from your demo, then to add to it more blocks trying to generate something into DDR4 and read it. Not sure if 2 devices try to access the DDR4 slave AXI at the same time what will happen... Ran into licensing problems "A valid license was not found for feature 'Synthesis' and/or device 'xcku15p'. Please run the Vivado License Manager for assistance in determining ...." It seems our device xcku15p doesn't come with free license so evaluation license is likely needed.

mwrnd commented 10 months ago

4gb_adit_xdma_ddr4_demo looks rather complex

Yes, I added various blocks to test other parts of the board.

I have installed [Vivado] 2021 ... basic block design using the DDR-4 guidelines from your demo

Yes, a much simpler design should work in Vivado 2021.2 to test DDR4:

innova2_4gb_adit_DDR4_Working_Block_Diagram

I plan to eventually update my innova2_flex_xcku15p_notes notes for Vivado 2023.2 and Ubuntu 22.04.

you set a different arbitration scheme (RD PRI REG)

The AXI Arbitration Scheme should not matter for such a simple design. For designs with multiple AXI blocks that use DDR4 Round-Robin guarantees low-latency access at the cost of throughput. To be honest I forgot to change it from the default Read Priority. Refer to the Memory IP Guide (pg150) Pg#143.

xcku15p doesn't come with free license

Yes, unfortunately not. You can use an evaluation license or the AWS Vivado 2023.2 AMI by the hour.

JTAG adapter vs another board (an old NETFPGA-1G-CML) and it worked. This means I have also kind of problematic JTAG port on the innova2.

The Innova-2 uses 1.8V for JTAG. Does your adaptor support that? I recommend one of the Officially-Supported JTAG Adaptors or the Waveshare Platform Cable clone. I am aware of inexpensive FT2232H-based JTAG SMT boards causing problems.

Why I'm trying to get JTAG working - to skip reboot when I want to set a new image to the FPGA.

Working JTAG will also allow you to use XSCT/XSDB to debug soft-core processors like the RISC-V or MicroBlaze.

AGenchev commented 9 months ago

I report JTAG success ! The JTAG adapters I tried are on the picture. IMG_20240109_230342e

First one I bought for US$17 on Aliexpress - it is a Chinese clone. The second one costs here $300 and I don't own it. Counterintuitive, the cloning performs slightly faster. You were right - my JTAG was not broken. I don't know now the reason it didn't work - very likely my misunderstanding the usage of innova2_flex_app: When you set the FPGA image to boot, you need to reboot the computer. So probably I inferred that I need to enable the JTAG, then reboot the computer so it boots with JTAG enabled. When the cheap DLC9 arrived (yesterday) I observed that it lacks suitable connector, so I used the ribbon cable from DLC10 (the 2.0mm ribbon costs more than the whole DLC9 clone with all its cables - crazy) with it just to test whether it works, then decided to experiment without rebooting the computer on the innova2. And boom - it worked. Then tested DLC10, it worked as well. After Vivado2023.2 (a new web install) killed the OS (OOM) with just 2 processes on 32GB RAM, i installed and configured zram and it helped a lot. Now I'm implementing the design you gave me above. The first attempt I forgot to set the address ranges. Then succeeded and of curiosity overclocked the DDR4 to 2666. retention still OK. ... It works: IMG_20240109_230342e1 W/o your help I'd never get there. Shall i close this issue, because we solved everything and I went off-topic (still have 1M questions) ?

mwrnd commented 9 months ago

I report JTAG success !

Excellent news!

misunderstanding the usage of innova2_flex_app

Thanks for pointing that out. I have tried to improve my notes.

After Vivado2023.2 (a new web install) killed the OS (OOM) with just 2 processes on 32GB RAM

When you run top do you have a swap partition? Vivado somehow manages to always get into swap even if a system has plenty of memory. Try the following which makes swap (Virtual Memory) less likely to be used. It is issues like these that make me wary of updating my notes for the latest Vivado too soon.

sudo su
echo 0 > /proc/sys/vm/compaction_proactiveness
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
echo vm.compaction_proactiveness=0 >>/etc/sysctl.conf
sysctl -w vm.compaction_proactiveness=0
sysctl -w vm.extfrag_threshold=1000
exit

succeeded and of curiosity overclocked the DDR4 to 2666. retention still OK

Good to know. I have added a comment to the innova2_4gb_adit_xdma_ddr4_demo project.

Shall i close this issue, because we solved everything and I went off-topic

Yes, please do. Create new issues if you encounter any problems.

AGenchev commented 9 months ago

When you run top do you have a swap partition?

Of course, but I had only 1 GB, because even though I use a PMR server HDD w 256MB cache, 32G swap would be awfully slow. Thanks for kernel config ideas, before try them, I ran this script (for 32GB RAM):

#!/bin/sh
swapoff -a
modprobe zram
zramctl /dev/zram0 --algorithm zstd --size 32G
mkswap  /dev/zram0
swapon --priority 100 /dev/zram0
swapon -a
swapon -s

And top now shows:

MiB Mem :  32039,7 total,  16255,2 free,  11414,2 used,   4370,2 buff/cache
MiB Swap:  33744,0 total,  33744,0 free,      0,0 used.  18705,4 avail Mem

(Vivado is running at background)