openthread / ot-br-posix

OpenThread Border Router, a Thread border router for POSIX-based platforms.
https://openthread.io/
BSD 3-Clause "New" or "Revised" License
419 stars 232 forks source link

otbr-agent received signal SIGSEGV, Segmentation fault. (PriorityQueue) #2475

Closed jinpeng1989 closed 1 week ago

jinpeng1989 commented 1 month ago

Describe the bug: The otbr-agent process crashed, and GDB debugging found that the error was near the PriorityQueue function. The ot-br-posix code used is: https://github.com/SiliconLabs/simplicity_sdk/tree/v2024.6.1-0/util/third_party/ot-br-posix Release note: https://github.com/SiliconLabs/simplicity_sdk/releases/tag/v2024.6.1-0 1170286e646140d3714b7a2d0178196

jinpeng1989 commented 1 month ago

Crash here, please see log. SyslogCatchAll-2024-08-28-1-and-2.zip image

jwhui commented 1 month ago

Are you able to reference a specific GitHub commit in an OpenThread repo? I did look at the Simplicity SDK link you provided above, but it wasn't obvious which OpenThread repo commit it was using.

Can you provide more details on the specific test scenario so that others can reproduce this issue?

abtink commented 1 month ago

Thanks for reporting this.

~@jwhui and I investigated this and found a potential cause for this situation.~

~This scenario can occur when IPv6 fragmentation is enabled and utilized. Could you confirm whether you have OPENTHREAD_CONFIG_IP6_FRAGMENTATION_ENABLE enabled in your project?~

~Brief description of the issue:~ ~- A message using IPv6 fragmentation can be placed in Ip6::mReassemblyList even if it's also marked for transmission to the Thread mesh.~ ~- This can lead to the message being included in two separate queues. Which is not allowed and causes the assert.~ ~- I'll submit a PR later to address this.~

Ignore earlier comment. Investigating this further, there is no issue related to this (as a clone of message is allocated to be added in Ip6::mReassemblyList).

jinpeng1989 commented 1 month ago

The release note for the simplicity_sdk describes the code repository used. The Silicon Labs OpenThread SDK includes all changes from the OpenThread GitHub repo (https://github.com/openthread/openthread) up to and including commit 1fceb225b. The Silicon Labs OpenThread SDK includes all changes from the OpenThread border router GitHub repo (https://github.com/openthread/ot-br-posix) up to and including commit e56c02006. https://www.silabs.com/documents/public/release-notes/open-thread-release-notes-2.5.1.0.pdf image

jwhui commented 1 month ago

Can you provide more information about your HW setup? Are you running this on a Raspberry Pi?

This is the first time we've seen this bug reported, so just trying to understand if there's an issue related to your specific setup.

jinpeng1989 commented 1 month ago

We discovered the issue during a system test involving five models of device. Three of them is SED, one is TBR, one is REED. The system consists of 1 TBR + 16 TME + 84 SED. However, it does not mean that Thread network size is a necessary condition for this issue. The otbr-agent crash has also been observed in small systems. One special feature is that both the diagnostics and mesh diagnostics interfaces are accessed.

jinpeng1989 commented 1 month ago

The otbr-posix runs on OpenWRT system. This solution has been around for two or three years. The otbr-posix code was recently updated to introduce the mesh dianostic feature. Many issues occur frequently on this version. image

jwhui commented 1 month ago

From the stack trace in https://github.com/openthread/ot-br-posix/issues/2475#issue-2507427967, it appears that this assert is getting triggered:

https://github.com/openthread/openthread/blob/4459c54069bb8573579aa4e84c3c6cb6ea82b1cf/src/core/common/message.cpp#L901-L902

However, the first thing that HandleSendQueue() does is call Dequeue(), which does this:

https://github.com/openthread/openthread/blob/4459c54069bb8573579aa4e84c3c6cb6ea82b1cf/src/core/common/message.cpp#L948-L951

So it's not clear yet why the asserts are failing.

jinpeng1989 commented 1 month ago

This issue occurred frequently in our test environment, and was observed at least once in five days. What can we do to further analyze this issue?

jwhui commented 1 month ago

This issue occurred frequently in our test environment, and was observed at least once in five days. What can we do to further analyze this issue?

If possible, you can help analyze the code path identified in https://github.com/openthread/ot-br-posix/issues/2475#issuecomment-2333217331 and determine where the assert conditions are no longer true.

abtink commented 1 month ago

I would suggest checking whether or not OPENTHREAD_CONFIG_IP6_FRAGMENTATION_ENABLE is enabled on your build.

If it is enabled, it would be good to see if you can disable it and test again (this would give a clue whether the fragmentation logic may be impacting this).