versatica / mediasoup

Cutting Edge WebRTC Video Conferencing
https://mediasoup.org
ISC License
6.27k stars 1.13k forks source link

Add server side ICE consent checks to detect silent WebRTC disconnection #1332

Closed ibc closed 9 months ago

ibc commented 9 months ago

Details

TODO

ibc commented 9 months ago

The only concern I have is that the application will receive a 'disconnected' event, but it has no means to know the reason why, and that's IMO important.

I think a 'reason' argument should be added to the notification from C++ to Node/Rust so in this case it would be something like ConsentCheckExpired and the corresponding ones in other cases.

reason argument where? This is what we have now:

this.safeEmit('icestatechange', iceState);

You mean adding a reason argument to that event in case iceState === 'disconnected'?

jmillan commented 9 months ago

You mean adding a reason argument to that event in case iceState === 'disconnected'?

Yes, for example.

The problem is not knowing the reason of the disconnected now that it can be due to this checks timeout.

ibc commented 9 months ago

The problem is not knowing the reason of the disconnected now that it can be due to this checks timeout.

NOTE: if the WebRTC client disconnects its side by closing the PeerConnection or by closing the browser tab, DTLS close alert arrives BEFORE to mediasoup and "dtlsstatechange" (state "closed") happens BEFORE. So if that happens the app should already close the transport. Basically "iceconnectionstatechange" event with state "disconnected" ONLY happens if consent timeout happens or if the client is connected via TCP and the TCP is abruptly closed, which in both cases means that the network is down.

Taking this into account I think that we don't need any "reason" argument.

ibc commented 9 months ago

Also note that, before this feature, "iceconnectionstatechange" with state "disconnected" ONLY may happen if the client was connected via TCP and it was abruptly closed without sending DTLS close first. So I strongly think we don't need any "reason" argument.

jmillan commented 9 months ago

Ok, we'll see if there is a need when troubleshooting. Not a blocker anyhow.

ibc commented 9 months ago

This is done, Rust included.

ibc commented 9 months ago

Hell, demo crashed:

2024-02-18T13:55:03.784Z mediasoup-demo-server:WARN:Room WebRtcTransport "icestatechange" event [iceState:disconnected], closing peer

2024-02-18T13:55:03.793Z mediasoup-demo-server:WARN:Room WebRtcTransport "icestatechange" event [iceState:disconnected], closing peer

2024-02-18T13:55:04.824Z mediasoup:ERROR:Channel Producer Channel error: Error: read ECONNRESET
2024-02-18T13:55:04.825Z mediasoup:ERROR:Worker worker process died unexpectedly [pid:1580729, code:null, signal:SIGSEGV]
2024-02-18T13:55:04.876Z mediasoup-demo-server:ERROR mediasoup Worker died, exiting  in 2 seconds... [pid:1580729]
2024-02-18T13:55:04.877Z mediasoup-demo-server:WARN:Room _createConsumer() | failed:InvalidStateError: Channel closed, pending request aborted [method:CONSUMER_RESUME, 
nazar-pc commented 9 months ago

Is it just me or everyone else is also fed up with memory-related issues in C++?

ibc commented 9 months ago

Ok I got core dump in demo server:

Program terminated with signal SIGSEGV, Segmentation fault.

#0  0x0000558176a1618b in RTC::StunPacket::CheckAuthentication(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) ()
#1  0x0000558176975be1 in RTC::IceServer::ProcessStunPacket(RTC::StunPacket*, RTC::TransportTuple*) ()
#2  0x0000558176a43909 in RTC::WebRtcServer::OnUdpSocketPacketReceived(RTC::UdpSocket*, unsigned char const*, unsigned long, sockaddr const*) ()
#3  0x0000558176d47515 in uv.udp_recvmmsg ()
#4  0x0000558176d47f6b in uv.udp_io ()
#5  0x0000558176d4b469 in uv.io_poll ()
#6  0x0000558176d3eb2f in uv_run ()
#7  0x00005581768f6a69 in DepLibUV::RunLoop() ()
#8  0x0000558176900f82 in Worker::Worker(Channel::ChannelSocket*) ()
#9  0x00005581768f5b04 in mediasoup_worker_run ()
#10 0x00005581768f4c4c in main ()
ibc commented 9 months ago

It's crashing in StunPacket::CheckAuthentication() method.

I did changes in that method, of course, since we used to only verify that the first part of the USERNAME attribute in the received STUN packet matches our server side ICE username fragment plus colon. But now we also have (or may have) the remote username fragment, so we match the received USERNAME attribute with localICEusernameFrag:remoteICEusernameFrag as per spec. I've verified that it works fine despite the remote provided the mediasoup WebRtcTransport with its remote ICE usernameFragment/pwd or not. So no idea what is crashing.

NOTE that in this PR we are also calling packet->CheckAuthentication() in STUN responses received from client. So we don't know if the crash is happening when receiving a STUN request or a STUN response.

ibc commented 9 months ago

I'm refactoring C++ code a bit to have better error traces.

ibc commented 9 months ago

Wow, another crash in the demo and this is NOT related to STUN...

Program terminated with signal SIGSEGV, Segmentation fault.

#0  0x0000563fbc0e96f6 in RTC::WebRtcTransport::OnRtcpDataReceived(RTC::TransportTuple*, unsigned char const*, unsigned long) ()
[Current thread is 1 (Thread 0x7fb43c53e440 (LWP 1643202))]
(gdb) bt
#0  0x0000563fbc0e96f6 in RTC::WebRtcTransport::OnRtcpDataReceived(RTC::TransportTuple*, unsigned char const*, unsigned long) ()
#1  0x0000563fbc0dbc9f in RTC::WebRtcServer::OnUdpSocketPacketReceived(RTC::UdpSocket*, unsigned char const*, unsigned long, sockaddr const*) ()
#2  0x0000563fbc3df515 in uv.udp_recvmmsg ()
#3  0x0000563fbc3dff6b in uv.udp_io ()
#4  0x0000563fbc3e3469 in uv.io_poll ()
#5  0x0000563fbc3d6b2f in uv_run ()
#6  0x0000563fbbf8ea69 in DepLibUV::RunLoop() ()
#7  0x0000563fbbf98f82 in Worker::Worker(Channel::ChannelSocket*) ()
#8  0x0000563fbbf8db04 in mediasoup_worker_run ()
#9  0x0000563fbbf8cc4c in main ()

However ICE disconnection happened before:

2024-02-19T11:36:27.220Z mediasoup-demo-server:WARN:Room WebRtcTransport "icestatechange" event [iceState:disconnected], closing peer
2024-02-19T11:36:35.394Z mediasoup-demo-server:WARN:Room WebRtcTransport "icestatechange" event [iceState:disconnected], closing peer
2024-02-19T11:36:35.485Z mediasoup:ERROR:Worker worker process died unexpectedly [pid:1643202, code:null, signal:SIGSEGV]
ibc commented 9 months ago

Ok, I've removed the sending of ICE requests from mediasoup and now it monitors received ICE request consents.

CC @ggarber @fippo

ibc commented 9 months ago

Wow, another crash in the demo and this is NOT related to STUN...

Program terminated with signal SIGSEGV, Segmentation fault.

#0  0x0000563fbc0e96f6 in RTC::WebRtcTransport::OnRtcpDataReceived(RTC::TransportTuple*, unsigned char const*, unsigned long) ()
[Current thread is 1 (Thread 0x7fb43c53e440 (LWP 1643202))]
(gdb) bt
#0  0x0000563fbc0e96f6 in RTC::WebRtcTransport::OnRtcpDataReceived(RTC::TransportTuple*, unsigned char const*, unsigned long) ()
#1  0x0000563fbc0dbc9f in RTC::WebRtcServer::OnUdpSocketPacketReceived(RTC::UdpSocket*, unsigned char const*, unsigned long, sockaddr const*) ()
#2  0x0000563fbc3df515 in uv.udp_recvmmsg ()
#3  0x0000563fbc3dff6b in uv.udp_io ()
#4  0x0000563fbc3e3469 in uv.io_poll ()
#5  0x0000563fbc3d6b2f in uv_run ()
#6  0x0000563fbbf8ea69 in DepLibUV::RunLoop() ()
#7  0x0000563fbbf98f82 in Worker::Worker(Channel::ChannelSocket*) ()
#8  0x0000563fbbf8db04 in mediasoup_worker_run ()
#9  0x0000563fbbf8cc4c in main ()

However ICE disconnection happened before:

2024-02-19T11:36:27.220Z mediasoup-demo-server:WARN:Room WebRtcTransport "icestatechange" event [iceState:disconnected], closing peer
2024-02-19T11:36:35.394Z mediasoup-demo-server:WARN:Room WebRtcTransport "icestatechange" event [iceState:disconnected], closing peer
2024-02-19T11:36:35.485Z mediasoup:ERROR:Worker worker process died unexpectedly [pid:1643202, code:null, signal:SIGSEGV]

And this has been fixed. Problem is that I was deallocating a tuple before notifying it to the listener who was then calling methods on a deallocated instance.

ibc commented 9 months ago

New stuff in FuzzerStunPacket.cpp also passing fuzzer.

ibc commented 9 months ago

I see a coredump in the demo but I strongly fail to correlate it with this PR:

Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44  ./nptl/pthread_kill.c: No such file or directory.
[Current thread is 1 (Thread 0x7f0774b3e440 (LWP 1695953))]
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007f0774bcad9f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007f0774b7bf32 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007f0774b66472 in __GI_abort () at ./stdlib/abort.c:79
#4  0x000055b064e4bd6e in RTC::RTCP::FeedbackRtpTransportPacket::AddPacket(unsigned short, unsigned long, unsigned long) ()
#5  0x000055b064dd9ade in RTC::TransportCongestionControlServer::FillAndSendTransportCcFeedback() ()
#6  0x000055b0650dcf3e in uv.run_timers ()
#7  0x000055b0650e0c88 in uv_run ()
#8  0x000055b064c9a849 in DepLibUV::RunLoop() ()
#9  0x000055b064ca4d62 in Worker::Worker(Channel::ChannelSocket*) ()
#10 0x000055b064c998e4 in mediasoup_worker_run ()
#11 0x000055b064c98a2c in main ()

Hell, problem is in FeedbackRtpTransport::AddPacket(), an assertion is failing. I'll open a separate issue for this:

2024-02-20T07:08:16.181Z mediasoup:ERROR:Worker (stderr) (ABORT) RTC::RTCP::FeedbackRtpTransport::AddPacket() | failed assertion `baseSet': base not set
2024-02-20T07:08:16.569Z mediasoup:ERROR:Channel Producer Channel error: Error: read ECONNRESET
2024-02-20T07:08:16.574Z mediasoup-demo-server:WARN:Room _createDataConsumer() | failed:Error: request timeout at Timeout._onTimeout (/var/www/v3demo.mediasoup.org/node_modules/protoo-server/lib/Peer.js:156:14) at listOnTimeout (node:internal/timers:569:17) at process.processTimers (node:internal/timers:512:7)
2024-02-20T07:08:16.784Z mediasoup:ERROR:Worker worker process died unexpectedly [pid:1695953, code:null, signal:SIGABRT]
2024-02-20T07:08:16.823Z mediasoup-demo-server:ERROR mediasoup Worker died, exiting  in 2 seconds... [pid:1695953]
ggarber commented 9 months ago

And this has been fixed. Problem is that I was deallocating a tuple before notifying it to the listener who was then calling methods on a deallocated instance.

Consider not destroying/deallocating anything. Just notify the application about "disconnected" state and let the application decide what to do.

Similarly in the browser side the ICE "disconnected" state is not destroying anything and can recover if the user decides to keep the connection open after that event.

I think it would simplify the logic and avoid some of the issues you are debugging here. Feel free to ignore it.

ibc commented 9 months ago

Consider not destroying/deallocating anything. Just notify the application about "disconnected" state and let the application decide what to do.

Similarly in the browser side the ICE "disconnected" state is not destroying anything and can recover if the user decides to keep the connection open after that event.

Perhaps I was not clear but I didn't mean that. We are not deallocating the WebRtcTransport when ICE is disconnected not even when remote sends us a DTLS Close Alert. We just notify the app about those events.

The bug I mean was an internal C++ thing fixed in this commit and has nothing to do with communication with Node or Rust: https://github.com/versatica/mediasoup/pull/1332/commits/7ee17538c4419f70133660cd085b5bcac5254be3