Closed ibc closed 4 months ago
@miroslavpejic85, I've not been able to reproduce the crash even by forcing things that IMHO should trigger the crash. Anyway, I think that this PR is gonna fix it.
Is it possible that you try this branch and confirm whether it solves the crash or not?
This flow is super complex but it's what it is:
When calling webRtcServer.close()
having active TCP connections:
~WebRtcServer
iterates all its WebRtcTransports
and calls `webRtcTransport->ListenServerClosed()``.
Transport::ListenServerClosed()
calls this->listener->OnTransportListenServerClosed()
.
So Router
deletes transport.
~WebRtcTransport
deletes IceServer
.
~IceServer
iterates tuples and calls this->listener->OnIceServerTupleRemoved(tuple)
.
So WebRtcTransport
calls tuple->CloseTcpConnection()
.
TransportTuple::CloseTcpConnection()
calls tcpConnection->TriggerClose()
.
So TcpConnectionHandle
calls this->listener->OnTcpConnectionClosed()
.
So TcpServerHandle
calls UserOnTcpConnectionClosed()
which calls this->listener->OnRtcTcpConnectionClosed()
[A], and then deletes the TcpConnectionHandle
.
Problem is [A] since it ends calling iceServer->RemoveTuple()
which calls listener->OnIceServerTupleRemoved()
again with same tuple.
Is it possible that you try this branch and confirm whether it solves the crash or not?
Sure, I need to reconstruct the worker on this branch and give it a try right?
cd worker; MEDIASOUP_BUILDTYPE='Debug' make
Yes. Or just add a git url pointing to this branch as mediasoup dependency in the package.json of your server app and run "npm install mediasoup".
When calling
webRtcServer.close()
having active TCP connections:
Note: The crash occurs despite not utilizing the WebRtcServer
, but rather relying on the default WebRtcTransportOption for the transports.
The fix in this PR is not related to WebRtcServer so it may be valid (assuming in doesn't crash anymore for you once you try the branch).
Great, then I'll try the fix and keep you updated.
5.
~IceServer
iterates tuples and callsthis->listener->OnIceServerTupleRemoved(tuple)
.
I think this can happen wherever OnIceServerTupleRemoved
is called with a tuple containing TCP connection that is not closed yet.
In IceServer::AddTuple, when the tuples
size exceeds MaxTuples
:
this->listener->OnIceServerTupleRemoved(this, removedTuple);
this->tuples.erase(std::next(it).base());
What the iterator will point to? This will either crash, or destroy some extra tuple.
In IceServer::RemoveTuple
it will stop short of recursively calling itself, because at that moment the connection is already closed, but nevertheless, it'll dive in webRtcTransportListener->OnWebRtcTransportTransportTupleRemoved
, tuple->CloseTcpConnection
- rather unnecessary.
It's better to block reenter IceServer::RemoveTuple
from any method that calls OnIceServerTupleRemoved
, including IceServer::RemoveTuple
itself.
~IceServer
iterates tuples and callsthis->listener->OnIceServerTupleRemoved(tuple)
.I think this can happen wherever
OnIceServerTupleRemoved
is called with a tuple containing TCP connection that is not closed yet. In IceServer::AddTuple, when thetuples
size exceedsMaxTuples
:this->listener->OnIceServerTupleRemoved(this, removedTuple); this->tuples.erase(std::next(it).base());
What the iterator will point to? This will either crash, or destroy some extra tuple.
It's indeed crashing but not sure why. This was done and tested long ago, not sure what has changed. I've just set static constexpr size_t MaxTuples{ 1 };
in IceServer.cpp
, connected a client with TCP and triggered a ICE restart (mediasoup demo):
mediasoup:Channel request() [method:TRANSPORT_RESTART_ICE] +7s
mediasoup:Channel request succeeded [method:TRANSPORT_RESTART_ICE, id:57] +0ms
mediasoup:WARN:Channel [pid:35928] RTC::IceServer::AddTuple() | too too many tuples, removing the oldest non selected one +0ms
mediasoup:ERROR:Worker (stderr) (ABORT) RTC::IceServer::AddTuple() | failed assertion `removedTuple': couldn't find any tuple to be removed +0ms
However I think it crashes because the code assumes that MaxTuples
is > 1.
What the iterator will point to? This will either crash, or destroy some extra tuple.
Why do you think it will crash? The comment above it clearly indicates what it does:
// Notify the listener.
this->listener->OnIceServerTupleRemoved(this, removedTuple);
// Remove it from the list of tuples.
// NOTE: Do it after notifying the listener since the listener may need to
// use/read the tuple being removed so we cannot free it yet.
// NOTE: This trick is needed since it is a reverse_iterator and
// erase() requires a iterator, const_iterator or bidirectional_iterator.
this->tuples.erase(std::next(it).base());
In
IceServer::RemoveTuple
it will stop short of recursively calling itself, because at that moment the connection is already closed, but nevertheless, it'll dive inwebRtcTransportListener->OnWebRtcTransportTransportTupleRemoved
,tuple->CloseTcpConnection
- rather unnecessary. It's better to block reenterIceServer::RemoveTuple
from any method that callsOnIceServerTupleRemoved
, includingIceServer::RemoveTuple
itself.
I also expected this to happen, but for whatever reason it doesn't (or I couldn't reproduce it). But I agree that it makes sense that we block reenter of RemoveTuple()
while invoking the listener->OnIceServerTupleRemoved()
callback. I'll do it soon.
@snnz I've done it in last commit, can you check?
Why do you think it will crash?
Because this->listener->OnIceServerTupleRemoved(this, removedTuple)
will, through all that chain, call this->RemoveTuple(removedTuple)
, which will find and erase removedTuple
.
@snnz I've done it in last commit, can you check?
Everything is ok now, I believe this will fix the problem.
In IceServer::RemoveTuple it will stop short of recursively calling itself, because at that moment the connection is already closed
While this PR fixes this I still don't understand why (before this PR) we didn't get an infinite loop. You say that it will stop recursively calling itself because the connection is closed, but we do not check if the connection is closed anywhere (if closed it should be deleted or being deleted). Well, now that I think about it I just removed that check in a recent related PR. Anyway I think everything is ok now.
but we do not check if the connection is closed anywhere
But there is a check, in TcpConnectionHandle::TriggerClose
:
void TcpConnectionHandle::TriggerClose()
{
MS_TRACE();
if (this->closed)
{
return;
}
InternalClose();
this->listener->OnTcpConnectionClosed(this);
}
InternalClose
sets closed
flag, thus next time TriggerClose
won't proceed and call OnTcpConnectionClosed
.
True.
@miroslavpejic85 any ETA for you to test this branch? After the discussion above it's clear that these changes make sense and I'd like to release ASAP, but having your ok would be great.
@ibc Everything seems to be running smoothly now, without any worker crashes since yesterday. You can go ahead and release it. Thank you!
@ibc Everything seems to be running smoothly now, without any worker crashes since yesterday. You can go ahead and release it. Thank you!
Thanks a lot!
Thanks a lot!
You're welcome! And thank you for addressing the issue promptly ;)
Fixes #1392