Invalid and Duplicate vring desc indices from avail

workzb commented 3 years ago

There appears to be a built in assumption when creating the virtuequeue structure that the first index to be read from the vring should be 0. vq->available_idx get's zero initialized which causes it to look at avail->ring[0] for the first read. This assumption breaks down if any rpmsg transactions have occurred previously.

The issue I ran into is as follows:

load a dummy m4 rpmsg application with uboot. (pingpong_bm.bin)
boot Linux and load the rpmsg_char driver
use m4fwloader to load the desired m4 application that uses rpmsg
grab 3 tx rpmsg buffer pointers and fill with data
run until there is a duplicate buffer returned from rpmsg_lite_alloc_tx_buffer()

In the example above the pingpong_bm application on the m4 sends one message to the A7 when it comes online and waits for the driver to send it a value. The A7 has already filled the tvq->avail ring with 256 messages so the idx value stored in the M4 TX avail ring is 256. When the pingpong_bm sends a message it causes the A7 to processes the buffer and put it back in the avail ring as expected at the 0 index and advances the avail->idx to 257. This is okay so far and correct behavior.

The issue occurs when another rpmsg application is loaded onto the m4 and starts getting tx buffers. Because the vq->available_idx is zero initialized the new application looks in avail->ring[0] instead of avail->ring[1]. The system should be looking at 1 because 1 message has already been sent and processed so the "read pointer" for the avail ring has been advanced. This information is lost when the new program is loaded. When the new program sends the message the A7 side processes it and puts the idx back into the avail ring. The avail->ring now has duplicate desc indices. I.E.

avail->ring[0] = 0
avail->ring[1] = 0

Now when the m4 application grabs the next tx buffer it gets desc index 0 again and begins to fill that with data. This can cause a number of problems. The M4 can modify data while the A7 is attempting to read and process the buffer. If the M4 requested multiple tx buffers it could be clobbering data unknowingly as it has two pointers to the same location. This issue also fails silently, causing data corruption and race condition issues.

This issue may be larger than just rpmsg-lite, it might be a flaw in the virtio design itself. It seems like there is no shared memory location that tracks the location of the consumer/reader pointer, there is only idx which tracks the producer/writer pointer.

Proposed Solutions: Either

Add documentation that states the m4 application using rpmsg-lite must be the first application loaded and cannot be reloaded after a message has been sent. OR
Modify the initialization of the available_idx value in the virtqueue structure so that it doesn't always start at 0 and somehow accounts for current idx value stored in the avail ring. OR
Update the vring structures so that they store the reader idx as well as the writer idx.

MichalPrincNXP commented 3 years ago

Hello @workzb this is a known weak point of Linux RPMsg vs. RPMsg_Lite communication. Cultivated approach is to re-initialize the rpmsg_lite and rpmsg driver each time new m4 app. is started and the rpmsg-based communication is "restarted". From the Linux point of view it is better to use the remoteproc component to load a new m4 app. and to handle all used resources gracefully.

workzb commented 3 years ago

Thanks for letting me know @MichalPrincNXP.

In all of the various forum posts I read related to heterogenous multicore NXP chips I didn't come across that this was a known issue. Maybe I overlooked this when reading the rpmsg-lite documentation but if it's not documented there I would suggest adding it as a warning or something to be aware of. I wasn't aware this was an issue and spent almost a month troubleshooting off and on chasing memory corruption and race condition issues, I just want to prevent someone else running into the same issues that I did.

From what I have seen in the RPMsg code it looks like it suffers from the same issue as the RPMsg-Lite code in that it assumes the read index starts at 0 in the avail_ring. It seems like a design issue with the virtio subsystem to always assume that both sizes initialize at the same time or to not allow one side to reset without restarting the whole vring communication channel. Is there some clean way to get the vring virtuqueue system to reinitialize when one side of the connection resets?

If the answer is I have to unload the entire virtio or virtio_rpmsg_bus subsystem every time I want to reprogram the M4 and have it use the rpmsg that seems a bit overkill to me. It would also require me to reconfigure and rebuild the kernel to make those kernels modular instead of built in. I might be misunderstanding your advice here.

I've been using the m4fwloader application from NXP is setting up remoteproc the better method?

MichalPrincNXP commented 3 years ago

Hello @workzb , I am sorry for late response. To be honest, I am not involved in building of Linux RPMsg vs. RPMsg_Lite type of applications. However, from what I know and what is stated in https://github.com/NXPmicro/imx-m4fwloader using imx-m4fwloader is recommended for debugging purposes only. Most clean way of loading/starting the secondary core app. is to use the Linux remoteproc component. Once loading the secondary core app. using the remoteproc all used resources are handled correctly, and also once decommissioning the secondary core app. all resources are shutdown correctly, incl. rpmsg vrings. I would recommend to visit mcuxpresso.nxp.com, download an sdk package for selected i.MX multicore part/board (evkmcimx7ulp, evkmimx8mq, evkmimx8mm, evkmimx8mn or evkmimx8mp) and have a look at provided multicore examples that show Linux-M4 interaction. Hope it helps.

nxp-mcuxpresso / rpmsg-lite

Invalid and Duplicate vring desc indices from avail #11