openthread / ot-br-posix

OpenThread Border Router, a Thread border router for POSIX-based platforms.
https://openthread.io/
BSD 3-Clause "New" or "Revised" License
404 stars 228 forks source link

Communication breakage between nrf52840 chip and host (imx6ull) on thread layer stress testing #1868

Open vignesh-ravikumar opened 1 year ago

vignesh-ravikumar commented 1 year ago

Upon stress testing of the thread layer communication over a period of time (around 30 mins to 1 hr) the communication (over UART, 115200 baud rate) between the host (imx6ull based SoC) and nrf52840 chip gets disconnected and it does not recover until otbr-agent is restarted.

Continuous commands were sent from controller to the development kit to reproduce the issue.

  1. 868b8d791fac9752d154ef0f0614ca15019056a9 - otbr-agent commit id (https://github.com/openthread/ot-br-posix)
  2. IEEE 802.15.4 hardware platform - nrf52840
  3. Build was done using Yocto recipe.
  4. Network topology - Star topology (direct communication between controller and development kit)

Expected behaviour is communication breakage should not happen between the nrf52840 chip and the host(imx6ull based SoC).

Addtional points: Should we raise the transmit buffer size and if we can how can it be done? Or is there any way to increase the tx timeout of a packet to higher value and will it help resolve the issue.

https://github.com/openthread/ot-br-posix/issues/1193 - we saw similar issues but it did not help

Otbr-agent logs during failure - May 10 09:48:03 12000137 bash[429]: otbr-agent[429]: 00:30:05.244 [W] Platform------: radio tx timeout May 10 09:48:03 12000137 bash[429]: otbr-agent[429]: 00:30:05.244 [C] Platform------: HandleRcpTimeout() at radio_spinel_impl.hpp:2275: RadioSpinelNoResponse May 10 09:48:03 12000137 otbr-agent[429]: 00:30:05.244 [W] Platform------: radio tx timeout May 10 09:48:03 12000137 otbr-agent[429]: 00:30:05.244 [C] Platform------: HandleRcpTimeout() at radio_spinel_impl.hpp:2275: RadioSpinelNoResponse These are the otbr-agent logs during failure

[Uploading otbr-agent.log…]()

Any help would be much appreciated. Thanks in advance

jwhui commented 1 year ago

The log line:

May 10 09:48:03 12000137 otbr-agent[429]: 00:30:05.244 [C] Platform------: HandleRcpTimeout() at radio_spinel_impl.hpp:2275: RadioSpinelNoResponse

Typically indicates that the RCP is not able to respond for some reason.

One thing to try is increasing the baudrate to 460800 or higher.

If the issue still occurs, you can try hooking a debugger up to the RCP to see why the RCP is not able to respond to commands from the host. Otherwise, you can try reaching out to Nordic for support.

chava33 commented 1 year ago

Anyone found the solution? @jwhui I see a similar issue.

jwhui commented 1 year ago

Anyone found the solution? @jwhui I see a similar issue.

@chava33 , can you provide more information about your setup? Enough information for others to reproduce the issue?

chava33 commented 1 year ago

@jwhui We see the similar issue as mentioned above. May 15 17:22:00 raspberrypi otbr-agent[3147]: 00:08:33.060 [W] Platform------: radio tx timeout May 15 17:22:00 raspberrypi otbr-agent[3147]: 00:08:33.060 [C] Platform------: HandleRcpTimeout() at radio_spinel_impl.hpp:2248: RadioSpinelNoResponse

One CC2652R1(rcp) as Leader and Border router. OTBR is running on Raspberry PI (March 2023). In a star topology, we have 20 sensors. After around 5-20 mins we see the otbr exits with the previous error message.

We increased the UART baud rate to 460800 , still have the same issue.

If i ping with payload 100bytes from sensor, then i can reproduce the issue much faster with 6 sensors (nrf53840).

I don't have the debugger setup to investigate the rcp side. Anything you could suggest?

version: OPENTHREAD/; POSIX; May 3 2023 14:38:20

rcp version:(rcp_CC26X2R1_LAUNCHXL_tirtos7_ticlang_sdk_7_10_00_98.bin) TI-OPENTHREAD/1.2.4.0; CC1352; May 12 2023 13:12:48

jwhui commented 1 year ago

@chava33 , have you tried using a different RCP? For example, using the nRF as an RCP?

chava33 commented 1 year ago

Thanks @jwhui . Not yet, I can try nrf and update you. Do you think it is an rcp related issue? -d 7 option not giving me more information on what was the command did not get the response (timeout). Do i need to enable the debug otbr-agent in code? otbr-agent -I wpan0 -B eth0 spinel+hdlc+uart:///dev/ttyACM0 trel://eth0 -d 7

jwhui commented 1 year ago

The RCP not responding typically indicates that it is an RCP issue.

yplam commented 1 year ago

I am using ot-br-posix c0897a70b907410d9be87eab16c41910bb14a000, ot-nrf528xx 7dc1f1406f437807cdafd6c53a974d76439b6a2b as rcp, mt7688A running openwrt as border router and see a similar issue.

But when I use ot-br-posix 486b8834bcd94649fae65d6f2c0b45142e01e0d2 with the same rcp, everything works fine.

@jwhui maybe it's a bug introduced by 21c5b675d883ba5e18415ba3d5dbb7ffc539829b ?

jwhui commented 1 year ago

@yplam , I don't think https://github.com/openthread/ot-br-posix/commit/21c5b675d883ba5e18415ba3d5dbb7ffc539829b has anything to do with RCP since it only touches the REST API.

The list of changes between the commits you mentioned are here

The list of changes for the OpenThread submodule update is here:

Reviewing the list of changes, I don't see what would affect communication with RCP.

You noted that you are using OpenWRT. One related commit in the above is: