paullouisageneau / libdatachannel

C/C++ WebRTC network library featuring Data Channels, Media Transport, and WebSockets
https://libdatachannel.org/
Mozilla Public License 2.0
1.82k stars 365 forks source link

Destructor hanging indefinitely #355

Closed sampleni closed 3 years ago

sampleni commented 3 years ago

I am currently experiencing an issue with the windows version of this library. My current set up has clients using the windows version of this library establishing connections with a linux server that reboots on a regular basis. Anytime this server reboots with connections active, those windows connections will hang indefinitely in the destructor for their connection. I don't seem to experience this issue in the other direction though, only the windows versions experience this hang.

This is the logic flow that creates the issue 1) windows client establishes control connection with server through socket.io 1) windows client establishes rtc::PeerConnection with linux server 2) windows client establishes rtc::DataChannel with linux server 3) data is successfully exchanged over rtc:DataChannel 4) linux server restarts 5) windows client recognizes that connection in socket.io is no longer active 6) windows client closes rtc::PeerConnection with rtc::PeerConnection::close() 7) windows client calls destructor for rtc::PeerConnection 8) rtc::PeerConnection destructor hangs indefinitely in rtc::Processor::join()

When looking into this it appears that rtc::Processor::mTasks still contains items and is never emptied. I did see this pull request https://github.com/paullouisageneau/libdatachannel/commit/13b3623759b23cc761bb926a789fb55efe3037a1 that looked like a fix for something similar at least if not my issue. I also saw that it was since pulled again with this pull request https://github.com/paullouisageneau/libdatachannel/commit/82568e3aa0d7df4b4d781574dd4429f91c975b61 but I didn't see any more information about either of these explaining the issue.

Any assistance on this would be appreciated, I've been stuck on this for a minute now.

paullouisageneau commented 3 years ago

Thank you for reporting, what version do you use? I guess it's 0.11.x, since rtc::PeerConnection internal logic has been moved to rtc::impl::PeerConnection on the master branch.

The Processor of PeerConnection handles the callbacks, are you sure all callbacks of the peer connection returned? They all need to finish first.

The double fix you reference is for a slightly different scenario: the thread pool is joined after the program exists the main function, which could cause a deadlock if a closing peer connection was still around, since the program could wait for an SCTP transport to close while the transport was itself waiting for tasks the thread pool would not pick up anymore.

sampleni commented 3 years ago

Originally I had the issue using 0.9.0. I tried updating to the master branch on 3/3/21 and am still hitting what appears to be the same issue. I'm not entirely sure what callbacks you're referring to, but I am getting the callback I attach to rtc::PeerConnection::onStateChange firing with a updated state of rtcState::RTC_CLOSED and rtc::PeerConnection::onGatheringStateChange firing with an updated state of rtcGatheringState::RTC_GATHERING_COMPLETE before my destructor is called.

paullouisageneau commented 3 years ago

OK, from where is the destructor called? Do you have code reproducing the issue?

sampleni commented 3 years ago

Sorry, I actually have to correct myself there. The callback attached to rtc::PeerConnection::onStateChange does not fire. I was seeing it from another instance in my logging and missed it. When I adjust to try and make sure I wait for the callback to be fired, it seems to just be hanging there though and never actually being triggered. I won't have time today to get you a reproduction example, but I will work to get one for you next week.

sampleni commented 3 years ago

And the destructor is being called from a few layers up as a nested class member of something that is being destroyed.

sampleni commented 3 years ago

Wanted to give you a heads up that we have identified the issue and it was that there was a certain scenario where we were not waiting on all the callbacks for the connection. For what it's worth though, something in the documentation stating that closure/cleanup cannot be done while waiting for these callbacks to fire would be helpful. I may have missed it, but I didn't see anything mentioning it.

paullouisageneau commented 3 years ago

Sure, I'll do, I have been working on a C API documentation (see DOC.md), but I've yet to do the same for the C++ API.