nxp-mcuxpresso / rpmsg-lite

RPMsg implementation for small MCUs
BSD 3-Clause "New" or "Revised" License
235 stars 74 forks source link

rpmsg_lite_send timeout not honored (virtqueue_kick may get stuck) #7

Closed fmntf closed 1 year ago

fmntf commented 4 years ago

Hello, I'm running RPMsg lite in two Cortex-M4 (MX8QM, FreeRTOS 10). The normal use cases are OK, however I'm working on edge situations.

For instance, if a node transmits a message with rpmsg_lite_send(...., timeout=100ms) and the other node is stuck (e.g., is in assert(false) or paused by JTAG), then the rpmsg_lite_send() will not return an error after 100ms. This happens because virtqueue_kick() is called without a reference to the timeout, and in my case MU_SendMsg() gets called, which gets stuck in an infinite while loop waiting for some flag that will never change (since the other M4 is dead).

I have work-arounded this issue by monkey-patching the code in an horrible way. It gets the job done, but I deserve the programmer's hell (done of tiny .svn directories everywhere for the eternity); however I would like to know what you think about this and if you are interested in a more elegant solution.

Thank you! Francesco

MichalPrincNXP commented 4 years ago

Hello @fmntf , thank you for reporting this. I think the MU_SendMsg() function call in platform_notify() should be replaced by non-blocking version of this function, i.e. by MU_SendMsgNonBlocking. As you can see from comments in porting layer for lpc55s69 for instance, there is no need to use the blocking send function and to wait until the previous message is consumed by the receiver side, because the same value of the virtqueue ID is written into the tx register when triggering the ISR for the receiver side. I will discuss this with people responsible for the QM port and get back to you after my vacation, ok? Regards, Michal

fmntf commented 4 years ago

Thank you Michal for your feedback.

Have a nice vacation, Francesco

MichalPrincNXP commented 4 years ago

Hello Francesco, sorry for late response ... I have discussed this case with colleague who made the QM port ... the MU_SendMsgNonBlocking can't be used because there are multiple virtual devices controlled by rpmsg_lite on this device. MU message is used to identify the virtual device to be kicked/notified and platform_notify event cannot be lost. Even the timeout happens, we still need to notify it again later until success. The proposed solution is to add a delay chain feature in porting/environment. The delay chain should be something like callback registry, and such callback will be called after certain delay. Thus the order of notifications will not lost and platform_notify() would not block. I have put this feature into the todo list to be implemented in future releases. Regards Michal