Worker: Fix possible double free when ICE consent check fails

ibc commented 4 months ago

Fixes #1392

ibc commented 4 months ago

@miroslavpejic85, I've not been able to reproduce the crash even by forcing things that IMHO should trigger the crash. Anyway, I think that this PR is gonna fix it.

Is it possible that you try this branch and confirm whether it solves the crash or not?

ibc commented 4 months ago

This flow is super complex but it's what it is:

When calling webRtcServer.close() having active TCP connections:

~WebRtcServer iterates all its WebRtcTransports and calls `webRtcTransport->ListenServerClosed()``.
Transport::ListenServerClosed() calls this->listener->OnTransportListenServerClosed().
So Router deletes transport.
~WebRtcTransport deletes IceServer.
~IceServer iterates tuples and calls this->listener->OnIceServerTupleRemoved(tuple).
So WebRtcTransport calls tuple->CloseTcpConnection().
TransportTuple::CloseTcpConnection() calls tcpConnection->TriggerClose().
So TcpConnectionHandle calls this->listener->OnTcpConnectionClosed().
So TcpServerHandle calls UserOnTcpConnectionClosed() which calls this->listener->OnRtcTcpConnectionClosed() [A], and then deletes the TcpConnectionHandle.

Problem is [A] since it ends calling iceServer->RemoveTuple() which calls listener->OnIceServerTupleRemoved() again with same tuple.

miroslavpejic85 commented 4 months ago

Is it possible that you try this branch and confirm whether it solves the crash or not?

Sure, I need to reconstruct the worker on this branch and give it a try right?

cd worker; MEDIASOUP_BUILDTYPE='Debug' make

ibc commented 4 months ago

Yes. Or just add a git url pointing to this branch as mediasoup dependency in the package.json of your server app and run "npm install mediasoup".

miroslavpejic85 commented 4 months ago

When calling webRtcServer.close() having active TCP connections:

Note: The crash occurs despite not utilizing the WebRtcServer, but rather relying on the default WebRtcTransportOption for the transports.

ibc commented 4 months ago

The fix in this PR is not related to WebRtcServer so it may be valid (assuming in doesn't crash anymore for you once you try the branch).

miroslavpejic85 commented 4 months ago

Great, then I'll try the fix and keep you updated.

snnz commented 4 months ago

5. ~IceServer iterates tuples and calls this->listener->OnIceServerTupleRemoved(tuple).

I think this can happen wherever OnIceServerTupleRemoved is called with a tuple containing TCP connection that is not closed yet. In IceServer::AddTuple, when the tuples size exceeds MaxTuples:

this->listener->OnIceServerTupleRemoved(this, removedTuple);
this->tuples.erase(std::next(it).base());

What the iterator will point to? This will either crash, or destroy some extra tuple.

In IceServer::RemoveTuple it will stop short of recursively calling itself, because at that moment the connection is already closed, but nevertheless, it'll dive in webRtcTransportListener->OnWebRtcTransportTransportTupleRemoved, tuple->CloseTcpConnection - rather unnecessary. It's better to block reenter IceServer::RemoveTuple from any method that calls OnIceServerTupleRemoved, including IceServer::RemoveTuple itself.

ibc commented 4 months ago

~IceServer iterates tuples and calls this->listener->OnIceServerTupleRemoved(tuple).

I think this can happen wherever OnIceServerTupleRemoved is called with a tuple containing TCP connection that is not closed yet. In IceServer::AddTuple, when the tuples size exceeds MaxTuples:
this->listener->OnIceServerTupleRemoved(this, removedTuple);
this->tuples.erase(std::next(it).base());
What the iterator will point to? This will either crash, or destroy some extra tuple.

It's indeed crashing but not sure why. This was done and tested long ago, not sure what has changed. I've just set static constexpr size_t MaxTuples{ 1 }; in IceServer.cpp, connected a client with TCP and triggered a ICE restart (mediasoup demo):

  mediasoup:Channel request() [method:TRANSPORT_RESTART_ICE] +7s
  mediasoup:Channel request succeeded [method:TRANSPORT_RESTART_ICE, id:57] +0ms
  mediasoup:WARN:Channel [pid:35928] RTC::IceServer::AddTuple() | too too many tuples, removing the oldest non selected one +0ms
  mediasoup:ERROR:Worker (stderr) (ABORT) RTC::IceServer::AddTuple() | failed assertion `removedTuple': couldn't find any tuple to be removed +0ms

However I think it crashes because the code assumes that MaxTuples is > 1.

What the iterator will point to? This will either crash, or destroy some extra tuple.

Why do you think it will crash? The comment above it clearly indicates what it does:

            // Notify the listener.
            this->listener->OnIceServerTupleRemoved(this, removedTuple);

            // Remove it from the list of tuples.
            // NOTE: Do it after notifying the listener since the listener may need to
            // use/read the tuple being removed so we cannot free it yet.
            // NOTE: This trick is needed since it is a reverse_iterator and
            // erase() requires a iterator, const_iterator or bidirectional_iterator.
            this->tuples.erase(std::next(it).base());

In IceServer::RemoveTuple it will stop short of recursively calling itself, because at that moment the connection is already closed, but nevertheless, it'll dive in webRtcTransportListener->OnWebRtcTransportTransportTupleRemoved, tuple->CloseTcpConnection - rather unnecessary. It's better to block reenter IceServer::RemoveTuple from any method that calls OnIceServerTupleRemoved, including IceServer::RemoveTuple itself.

I also expected this to happen, but for whatever reason it doesn't (or I couldn't reproduce it). But I agree that it makes sense that we block reenter of RemoveTuple() while invoking the listener->OnIceServerTupleRemoved() callback. I'll do it soon.

ibc commented 4 months ago

@snnz I've done it in last commit, can you check?

snnz commented 4 months ago

Why do you think it will crash?

Because this->listener->OnIceServerTupleRemoved(this, removedTuple) will, through all that chain, call this->RemoveTuple(removedTuple) , which will find and erase removedTuple.

@snnz I've done it in last commit, can you check?

Everything is ok now, I believe this will fix the problem.

ibc commented 4 months ago

In IceServer::RemoveTuple it will stop short of recursively calling itself, because at that moment the connection is already closed

While this PR fixes this I still don't understand why (before this PR) we didn't get an infinite loop. You say that it will stop recursively calling itself because the connection is closed, but we do not check if the connection is closed anywhere (if closed it should be deleted or being deleted). Well, now that I think about it I just removed that check in a recent related PR. Anyway I think everything is ok now.

snnz commented 4 months ago

but we do not check if the connection is closed anywhere

But there is a check, in TcpConnectionHandle::TriggerClose:

void TcpConnectionHandle::TriggerClose()
{
    MS_TRACE();

    if (this->closed)
    {
        return;
    }

    InternalClose();

    this->listener->OnTcpConnectionClosed(this);
}

InternalClose sets closed flag, thus next time TriggerClose won't proceed and call OnTcpConnectionClosed.

ibc commented 4 months ago

True.

ibc commented 4 months ago

@miroslavpejic85 any ETA for you to test this branch? After the discussion above it's clear that these changes make sense and I'd like to release ASAP, but having your ok would be great.

miroslavpejic85 commented 4 months ago

@ibc Everything seems to be running smoothly now, without any worker crashes since yesterday. You can go ahead and release it. Thank you!

ibc commented 4 months ago

@ibc Everything seems to be running smoothly now, without any worker crashes since yesterday. You can go ahead and release it. Thank you!

Thanks a lot!

miroslavpejic85 commented 4 months ago

Thanks a lot!

You're welcome! And thank you for addressing the issue promptly ;)

versatica / mediasoup

Worker: Fix possible double free when ICE consent check fails #1393