openthread / ot-br-posix

OpenThread Border Router, a Thread border router for POSIX-based platforms.
https://openthread.io/
BSD 3-Clause "New" or "Revised" License
405 stars 228 forks source link

Crash in OTBR related to SRP #2519

Open agners opened 6 days ago

agners commented 6 days ago

Describe the bug A clear and concise description of what the bug is.

I've encountered a crash using a relatively recent commit ff7227ea9a2dc0e50f92ab93bd5c9a29e82fe793. It coincided with the shutdown of another OTBR on the same network (this might be by chance, but the timing is very suspicious). This has been observed with the Home Assistant OpenThread Border Router add-on 2.11.0.

otbr-agent: /usr/src/ot-br-posix/third_party/openthread/repo/src/core/net/udp6.cpp:178: ot::Error ot::Ip6::Udp::Open(SocketHandle&, ReceiveHandler, void*): Assertion `!IsOpen(aSocket)' failed.
00:03:35.331 [C] Platform------: ------------------ BEGINNING OF CRASH -------------
00:03:35.331 [C] Platform------: *** FATAL ERROR: Caught signal: 6 (Aborted)
00:03:35.351 [C] Platform------: # 0: /usr/sbin/otbr-agent(+0x38d660) [0x5579a8d660]
00:03:35.351 [C] Platform------: # 1: /usr/sbin/otbr-agent(+0x38d7a0) [0x5579a8d7a0]
00:03:35.351 [C] Platform------: # 2: linux-vdso.so.1 __kernel_rt_sigreturn+0x0 [0xb5eb97a0]
00:03:35.351 [C] Platform------: # 3: /lib/aarch64-linux-gnu/libc.so.6(+0x809f0) [0x7fb59a09f0]
00:03:35.351 [C] Platform------: # 4: /lib/aarch64-linux-gnu/libc.so.6 gsignal+0x1c [0xb595a72c]
00:03:35.351 [C] Platform------: # 5: /lib/aarch64-linux-gnu/libc.so.6 abort+0xf0 [0xb594747c]
00:03:35.351 [C] Platform------: # 6: /lib/aarch64-linux-gnu/libc.so.6(+0x341a4) [0x7fb59541a4]
00:03:35.351 [C] Platform------: # 7: /lib/aarch64-linux-gnu/libc.so.6(+0x3420c) [0x7fb595420c]
00:03:35.351 [C] Platform------: # 8: /usr/sbin/otbr-agent ot::Ip6::Udp::Open(ot::Ip6::Udp::SocketHandle&, void (*)(void*, otMessage*, otMessageInfo const*), void*)+0x5c [0x799cfe0c]
00:03:35.351 [C] Platform------: # 9: /usr/sbin/otbr-agent ot::Ip6::Udp::Socket::Open()+0x38 [0x799cfa54]
00:03:35.351 [C] Platform------: #10: /usr/sbin/otbr-agent ot::Srp::Client::Start(ot::Ip6::SockAddr const&, ot::Srp::Client::Requester)+0x78 [0x799c14bc]
00:03:35.351 [C] Platform------: #11: /usr/sbin/otbr-agent ot::Srp::Client::ProcessAutoStart()+0x364 [0x799c5900]
00:03:35.351 [C] Platform------: #12: /usr/sbin/otbr-agent ot::Srp::Client::HandleNotifierEvents(ot::Events)+0x50 [0x799c176c]
00:03:35.351 [C] Platform------: #13: /usr/sbin/otbr-agent ot::Notifier::EmitEvents()+0x148 [0x7997b0d8]
00:03:35.351 [C] Platform------: #14: /usr/sbin/otbr-agent ot::TaskletIn<ot::Notifier, &ot::Notifier::EmitEvents>::HandleTasklet(ot::Tasklet&)+0x18 [0x7997b978]
00:03:35.351 [C] Platform------: #15: /usr/sbin/otbr-agent ot::Tasklet::RunTask()+0x1c [0x7997e6f0]
00:03:35.351 [C] Platform------: #16: /usr/sbin/otbr-agent ot::Tasklet::Scheduler::ProcessQueuedTasklets()+0x68 [0x7997e694]
00:03:35.351 [C] Platform------: #17: /usr/sbin/otbr-agent otTaskletsProcess+0x38 [0x79a2ac20]
00:03:35.351 [C] Platform------: #18: /usr/sbin/otbr-agent otbr::Ncp::RcpHost::Process(otbr::MainloopContext const&)+0x1c [0x79a3fd20]
00:03:35.351 [C] Platform------: #19: /usr/sbin/otbr-agent otbr::MainloopManager::Process(otbr::MainloopContext const&)+0x68 [0x79a56ec4]
00:03:35.351 [C] Platform------: #20: /usr/sbin/otbr-agent otbr::Application::Run()+0x1e8 [0x7990ac34]
00:03:35.351 [C] Platform------: #21: /usr/sbin/otbr-agent(+0x2107e0) [0x55799107e0]
00:03:35.351 [C] Platform------: #22: /usr/sbin/otbr-agent main+0x88 [0x799109a0]
00:03:35.351 [C] Platform------: #23: /lib/aarch64-linux-gnu/libc.so.6(+0x27740) [0x7fb5947740]
00:03:35.351 [C] Platform------: #24: /lib/aarch64-linux-gnu/libc.so.6 __libc_start_main+0x98 [0xb5947818]
00:03:35.351 [C] Platform------: #25: /usr/sbin/otbr-agent _start+0x30 [0x7990a570]
00:03:35.351 [C] Platform------: ------------------ END OF CRASH ------------------

To Reproduce Information to reproduce the behavior, including:

  1. Git commit id: ff7227ea9a2dc0e50f92ab93bd5c9a29e82fe793
  2. IEEE 802.15.4 hardware platform: EFR32MG21
  3. Build steps
  4. Network topology

Expected behavior A clear and concise description of what you expected to happen.

Console/log output If applicable, add console/log output to help explain your problem.

Additional context Add any other context about the problem here.

abtink commented 6 days ago

Thanks, @agners.

Based on my analysis, this appears to be a platform UDP issue and not related to SRP client itself. I assume OPENTHREAD_CONFIG_PLATFORM_UDP_ENABLE is enabled and used, which delegates the handling of UDP sockets to the underlying platform (POSIX in this case).

The assertion is happening in Udp::Open() where the socket seems to already be open:

Error Udp::Open(SocketHandle &aSocket, ReceiveHandler aHandler, void *aContext)
{
    Error error = kErrorNone;

    OT_ASSERT(!IsOpen(aSocket));

    aSocket.Clear();
    aSocket.mHandler = aHandler;
    aSocket.mContext = aContext;
    ...

The Srp::Client::ProcessAutoStart() method will always stop the client (which in turn closes the UDP socket) before restarting it again:

    if (IsRunning())
    {
        VerifyOrExit((GetServerAddress() != serverSockAddr) || shouldRestart);
        Stop(kRequesterAuto, kResetRetryInterval);
    }

    ...
    IgnoreError(Start(serverSockAddr, kRequesterAuto));

The Udp::Close() method can fail only when CONFIG_PLATFORM_UDP_ENABLE is used and the platform provided otPlatUdpClose() fails:

Error Udp::Close(SocketHandle &aSocket)
{
    Error error = kErrorNone;

    VerifyOrExit(IsOpen(aSocket));

#if OPENTHREAD_CONFIG_PLATFORM_UDP_ENABLE
    error = otPlatUdpClose(&aSocket);
#endif
    SuccessOrExit(error);

    RemoveSocket(aSocket);
    aSocket.GetSockName().Clear();
    aSocket.GetPeerName().Clear();

exit:
    return error;
}

My suggestion is to investigate the otPlatUdpClose() implementation (whether there are any additional changes in it) and if/how it may fail.