paullouisageneau / libdatachannel

C/C++ WebRTC network library featuring Data Channels, Media Transport, and WebSockets
https://libdatachannel.org/
Mozilla Public License 2.0
1.79k stars 361 forks source link

SIGABRT in IceTransport::TimeoutCallback with libnice #1102

Closed DarthBo closed 7 months ago

DarthBo commented 9 months ago

I've seen some occasional crashes in the field that usually seem to come from IceTransport::TimeoutCallback.

I've had a look at the implementation of the libnice backend, I'm assuming it's a lifetime issue with the timer stuff, but I'm afraid my lack of experience with glib makes it hard for me to figure out what's going wrong. I also don't have a way to reproduce it yet, so I'm kind of hoping you'll take one look at the relevant code and say "aha, whoops" :sweat_smile:

libdatachannel version: 0.19.5 libdatachannel build flags: NO_WEBSOCKET, NO_MEDIA, USE_NICE libnice version: 0.1.21 libglib version: 2.72.3

last log prints:

14:52:44:731 [Debug] Peer Connection state changed: connecting
14:52:45:086 [Debug] Peer Connection state changed: connected
14:52:46:983 [Debug] Peer Connection state changed: disconnected
14:52:46:984 [Debug] Peer Connection state changed: closed
14:52:48:299 [Warning] rtc::impl::IceTransport::processTimeout@745: ICE timeout 

stacktrace:

Fatal Error: SIGABRT / 0xfffffffa / 0x190

0   libc.so.6                       0xffffb9b73a60      __pthread_kill_implementation
1   libc.so.6                       0xffffb9b3b3dc      __GI_raise
2   libc.so.6                       0xffffb9b2b224      __GI_abort
3   libc.so.6                       0xffffb9b35bc0      __assert_fail_base
4   libc.so.6                       0xffffb9b35c24      __GI___assert_fail
5   libc.so.6                       0xffffb9b74d34      ___pthread_mutex_lock
6   libdatachannel.so.0.19          0xffffba99c988      [inlined] __gthread_mutex_lock (gthr-default.h:749)
7   libdatachannel.so.0.19          0xffffba99c988      [inlined] __gthread_recursive_mutex_lock (gthr-default.h:811)
8   libdatachannel.so.0.19          0xffffba99c988      [inlined] std::recursive_mutex::lock (mutex:108)
9   libdatachannel.so.0.19          0xffffba99c988      [inlined] std::lock_guard<T>::lock_guard (std_mutex.h:229)
10  libdatachannel.so.0.19          0xffffba99c988      [inlined] rtc::synchronized_callback<T>::operator() (utils.hpp:80)
11  libdatachannel.so.0.19          0xffffba99c988      rtc::impl::Transport::changeState (transport.cpp:64)
12  libdatachannel.so.0.19          0xffffba95a130      rtc::impl::IceTransport::TimeoutCallback (icetransport.cpp:844)
13  libglib-2.0.so.0                0xffffbb5441fc      g_timeout_dispatch
14  libglib-2.0.so.0                0xffffbb543814      g_main_context_dispatch
15  libglib-2.0.so.0                0xffffbb543a58      g_main_context_iterate.constprop.0
16  libglib-2.0.so.0                0xffffbb543e80      g_main_loop_run
17  libstdc++.so.6                  0xffffb9ddfc4c      execute_native_thread_routine
18  libc.so.6                       0xffffb9b72170      start_thread
19  libc.so.6                       0xffffb9bc7f18      thread_start
paullouisageneau commented 9 months ago

I think there is a possible race condition as the destructor for IceTransport first disarm the timer then destroys the agent. However, it would be possible that the agent changes states between the two and rearms the timer, resulting in IceTransport::TimeoutCallback being called later for a destroyed object.

Could you please check whether https://github.com/paullouisageneau/libdatachannel/pull/1103 fixes the issue?

DarthBo commented 9 months ago

Oof, I definitely never would have thought of that. Thanks for looking into it!

I'll pull the patch and see if we can reproduce it, will report back :)

paullouisageneau commented 8 months ago

Any news about this?

DarthBo commented 8 months ago

Oh sorry, I didn't realise it has already been 3 weeks :scream:

I haven't seen the issue since applying the patch, but I never really bumped into it before myself either. I was hoping to push a build to our beta testers (who have bumped into it) last week, but unrelated issues delayed that release.

It might be another couple of weeks before I can safely say it's fixed :sweat_smile:

DarthBo commented 8 months ago

I haven't seen a single libnice/datachannel related crash so far, so it certainly looks good :)

paullouisageneau commented 7 months ago

Great, thank you for testing!