openthread / ot-br-posix

OpenThread Border Router, a Thread border router for POSIX-based platforms.
https://openthread.io/
BSD 3-Clause "New" or "Revised" License
420 stars 235 forks source link

The frame count value of the link layer of the TBR changes unexpectedly. #2482

Closed jinpeng1989 closed 1 month ago

jinpeng1989 commented 2 months ago

Describe the bug A clear and concise description of what the bug is. Normally, the frame count at the link layer of an outgoing packet should increase only. From the capture of packets, the frame count value will change randomly.

The packets of Sniffer_00004_20240909080000_00000_20240909080011.pcapng Frame 210273 -> 210399: 256 -> 1256 Frame 219115 -> 219218: 5743 -> 6744 Frame 233260 -> 233469: 10857 -> 11857 Frame 498721 -> 499047: 27633 -> 1737 The packets of Sniffer_00004_20240909080000_00001_20240909160653.pcapng Frame 170491 -> 171713: 4973 -> 2595 Frame 181418 -> 181879: 2663 -> 3663 Frame 186960 -> 187946: 3711 -> 2693 Frame 203442 -> 204144: 2809 -> 2767 Frame 227400 -> 228364: 2874 -> 3874 Frame 251150 -> 252659: 3999 -> 2969 Frame 264497 -> 264724: 3368 -> 27637 Frame 284917 -> 285116: 27833 -> 3133 Frame 360593 -> 361656: 3536 -> 4536 Frame 378225 -> 379158: 4615 -> 3627 Frame 479837 -> 481744: 4107 -> 5107 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

jinpeng1989 commented 2 months ago

Network key: 20358d714e6344bce360d875089a0cf1 RLOC16 of TBR: wpan.src16 == 0xf400

Sniffer_00004_20240909080000_00000_20240909080011.zip

Sniffer_00004_20240909080000_00001_20240909160653.zip

Sniffer_00004_20240909080000_00002_20240910065541.zip

jinpeng1989 commented 2 months ago

Once the frame count value becomes smaller, the encrypted packet of the Mac layer sent by the TBR will be dropped by the neighbor device. This issue comes up frequently. It can take hours or days for the network to return to normal. Because the frame count value needs to increase to the previous frame count maximum. Then the neighboring device will accept the new packet.

jinpeng1989 commented 2 months ago

I check another set of test scenarios, 1 TBR + 1 SED. The frame count value did not decrease, but it suddenly increased by 1000 on three occasions. Increasing by 1,000 doesn't seem random. It's more like design. In what case add 1000 instead of 1? Frame 27048 -> 27085: 16054 -> 17054 Frame 31368 -> 31408: 18969 -> 19969 Frame 33708 -> 33788: 10857 -> 21974 21 22 23

jwhui commented 2 months ago

The frame count value did not decrease, but it suddenly increased by 1000 on three occasions. Increasing by 1,000 doesn't seem random. It's more like design. In what case add 1000 instead of 1?

To minimize writes to non-volatile storage, OpenThread stores the current frame counter + 1000. Observing the frame counter increase by 1000 means that the OpenThread process has restarted.

https://github.com/openthread/openthread/blob/7673194af6c4ede079afc8a9af24f4b0493373a2/src/core/config/misc.h#L272

jwhui commented 2 months ago

In most cases, the radio driver itself increments the MAC Frame Counter. Typically, OT will only set the MAC Frame Counter when the process is starting or when recovering from an unexpected RCP reset.

There was a bug related to RCP restoration that was recently fixed:

However, if RCP restoration is getting invoked frequently, this would indicate that the RCP is resetting unexpectedly and the root cause of those reset failures should be found. RCP restoration is just sweeping the underlying bug under the covers. One way to really make obvious whether unexpected RCP resets is causing issues is by disabling the RCP restoration feature by setting the following to 0 at build time:

OPENTHREAD_SPINEL_CONFIG_RCP_RESTORATION_MAX_COUNT
jinpeng1989 commented 2 months ago

According to the analysis of logs, from 11 o 'clock on September 9 to 13 o 'clock on September 10, otbr-agent continued to run without restarting. During this period, otbr-agent detected 24 "RCP failure detected" and attempted to recover the RCP. The corresponding "RCP failure detected" can be found for each frame count value change. So, indeed, as you state, the root cause is "RCP failure detected".

RCP_failure.log image

jwhui commented 2 months ago

Yes, I suggest resolving the frequent RCP failures.

jinpeng1989 commented 2 months ago

Reproduced the issue again, the otbr-agent log is set to debug. The log is attached. It may be helpful to analyze this issue. RCP_failure detected_2024-09-12.zip

jwhui commented 2 months ago

All of the failures are due to:

otbr-agent[31734]: 00:03:55.361 [W] P-RadioSpinel-: radio tx timeout

This indicates that the radio driver is not responding to the TX request in a timely manner.

jinpeng1989 commented 2 months ago

It is worth mentioning that when switching back to GSDK 4.4.3. After a week of testing, the "RCP failure detected" issue did not reproduced. This issue is easily reproduced in the Simplicity SDK, so it can be basically concluded. GSDK 4.3.3 does not have this issue.

jwhui commented 1 month ago

Thanks for the update. Feel free to close this issue if it's resolved.