Device error caused by reading memory through the FPGA device and cannot be recovered

sansure commented 1 year ago

Hello ufrisk, I used the vmm_example routine and made some minor modifications to only retrieve process information and loaded modules. However, I frequently encounter exceptions that cannot be recovered from. I have been searching for a long time but haven't been able to determine the root cause. Could you please help me investigate the cause and suggest possible solutions? I would greatly appreciate it.

Initialize the FPGA device using API functions: hVMM = VMMDLL_Initialize(5, (LPSTR[]) { "", "-printf", "-v", "-device", "rawudp://ip=192.168.1.10" }).
Retrieve information of all current processes using result = VMMDLL_ProcessGetInformationAll(hVMM, &pProcessInformationAll, &cProcessInformation). Read the loaded modules of the process shown in the figure.
To read the loaded modules, use dwPID = pProcessInformationAll[27].dwPID; result = VMMDLL_Map_GetModuleU(hVMM, dwPID, &pModuleMap, 0). Sometimes the following information is returned, and sometimes it is not, but pModuleMap->cMap is always 0. After running the program for a while, the following errors occur. Error 1: (Not frequently occurring) Error 2: (Frequently occurring) When the above errors occur, it is noticed that the program is no longer functioning. In this case, exiting the test program and restarting it results in the inability to connect to the device. After encountering this error, the only way to connect normally again is by restarting the computer and the board. However, the error reoccurs after a short period of time.

Question 1: What is the cause of this error, and how can it be recovered without restarting the computer?

Question 2: When the issue occurs, it is observed that the "com_tx_prog_full" signal in the figure remains at a low level and cannot be recovered. I suspect that an unknown reason causes the PCIE transaction to not end properly, resulting in continuous data transmission to the PC and causing access errors in the PC software, making it impossible to read and recover the data. Is this the cause? If so, is there a way to solve it in the software? If not, can a logic be added to the FPGA to terminate data upload and restore functionality in the software?

ufrisk commented 1 year ago

I should upgrade the firmware of the NeTV2 to v4.12. I'll add this to my todo list. This might help some issues.

Main issue is probably that you have and AMD system or trying to access your system over Thunderbolt, i.e. you'd need to construct a "memory map" as described here

Is it working better with a memory map?

sansure commented 1 year ago

I'm glad to receive your response. The motherboard I'm currently using is an Intel motherboard, and the memory mapping detected by the expansion card is the same as the one obtained through software on the PC host. Therefore, I don't think the issue is caused by memory mapping affecting access. I have always thought that it might be due to network communication or other unknown reasons leading to data loss received by the FPGA, which in turn causes the PCIE IP core to not receive valid command data, resulting in abnormal operation and continuous reporting of data to the FIFO (it is also possible that the Last data frame is lost). I was wondering if there is a solution like this: by detecting the "com_tx_prog_full" signal and forcing the PCIE IP core to end the current data transfer through hardware logic when it is in an abnormal state. This way, the software side can resend new commands to restore normal operation without the need for reinsertion.

ufrisk commented 1 year ago

yeah, it may be an issue with the PCIe core getting full. If that happens it often locks up. I changed that in the more recent versions so I drop packets before that happens now. But this change is not ported to the NeTV2 implementation.

It's something I'd have to look into doing. I'm like super busy now with another project, but I'll possibly find the time this weekend to look it over.

If you're up to it you could try to change this line https://github.com/ufrisk/pcileech-fpga/blob/80c37f1eff50072b6de15db24e2444f830ba2666/NeTV2/src/pcileech_pcie_a7.sv#L135C86-L135C86

from

.m_axis_rx_tready           ( tlp_rx.ready | ~dfifo_pcie.clk100_en  ),  // <-

to

.m_axis_rx_tready           ( 1'b1 ),  // <-

sansure commented 1 year ago

I referred to the code on git and modified it based on the corresponding part in EnigmaX1, which also includes the code you sent above. But the test result is that the device initialization cannot be completed.

The feedback from the device is as follows:

[CORE] Initialization Failed. Unable to locate valid DTB. #2 VmmProc: Unable to auto-identify operating system for PROC file system mount. Specify PageDirectoryBase (DTB/CR3) in -cr3 option if value if known. [CORE] Failed to initialize.

sansure commented 1 year ago

After re-initializing, it was found that the device had already failed abnormally, and the feedback result is as follows:

DEVICE: FPGA: ERROR: Unable to connect to FPGA device [0,v0.0,0000] MemProcFS: Failed to connect to memory acquisition device.

ufrisk commented 1 year ago

I've looked into this and I was able to replicate your issue.

I've updated the NeTV2 bitstream to v4.12. It should resolve the need to power cycle the device on failure but it won't resolve the root issue.

The root issue is that the UDP-based protocol is extremely basic. When packets arrive out-of-order or get lost it will mess things up severely. In my current setup it works fine if I run it on a dedicated link or via one switch. If I however route the packets or involve more switches it will mess things up.

Since I first created the NeTV2 bitstream the IP core I rely on for the network communication have a new sister version released which supports a basic TCP server instead of an UDP server. Moving from UDP to TCP would most likely resolve the issue. I'm quite busy with other projects though and the NeTV2 isn't a top priority for me so I don't expect to test it and add support for it in the free open version here on Github due to the work involved.

But please let me know if the new version 4.12 will work together with a direct connection for you.

sansure commented 1 year ago

Thank you very much for your help. Over the weekend, I also referred to other projects' code on Git and modified it accordingly. After testing, it seems that the issue requiring system reset to solve problems has been temporarily addressed. At the same time, I compared your updates to my modifications, which should be the same, so the new version should resolve the corresponding issues. Thank you again for your assistance!

ufrisk / MemProcFS

Device error caused by reading memory through the FPGA device and cannot be recovered #207