webrtc: figure out how to properly close datachannels

marten-seemann commented 1 year ago

Closing WebRTC datachannels is not straightforward. When closing the datachannel, it is instead reset, leading to data sent on that datachannel (potentially) being discarded. See RFC 8831: https://datatracker.ietf.org/doc/html/rfc8831#name-closing-a-data-channel:

Closing of a data channel MUST be signaled by resetting the corresponding outgoing streams [RFC6525]. This means that if one side decides to close the data channel, it resets the corresponding outgoing stream. When the peer sees that an incoming stream was reset, it also resets its corresponding outgoing stream. Once this is completed, the data channel is closed.

This has led to interop failures in go-libp2p, and delayed the release of WebRTC from the go-libp2p v0.31 release. See https://github.com/libp2p/go-libp2p/pull/2337#issuecomment-1696619731 for details.

The problem is that if we don't close a datachannel, the underlying WebRTC / SCTP stack will need to continue keeping track of that datachannel / stream, leading to a memory leak, since a WebRTC connection can use up to 64k streams per association. This is a known bug in rust-libp2p: https://github.com/libp2p/rust-libp2p/issues/4333.

How can we solve this?

This is annoying, and yet another time where WebRTC's suboptimal stream state machine is coming to bite us (can't wait for a WebRTC running on top of QUIC!).

First, here's a workaround: Limit the total number of streams to a value lower than 64k, e.g. 500. This is roughly the number of concurrent streams that other stream multiplexers allow, and it allows for some initial interaction between two peers. Important: Once that number is reached, close the underlying WebRTC connection (don't just return an error from the NewStream method). This might lead to some unacknowledged data to be lost, but peers going away in the middle of a transfer is something that application protocols need to deal with anyway. Obviously, this is not ideal, but it's better than releasing code containing a significant memory leak.

Here's how we could actually solve this: Assuming there's no better way to reliably close a stream, we could add some additional signaling on the libp2p layer. We already use a Protobuf to mirror parts of the QUIC state machine, and this would just be one additional message. Specifically, we'd need to add an acknowledgement for the FIN. Once the FIN-ACK has been received, a peer could actually close the underlying datachannel. We'll need to carefully check how this affects the send and the receive side of the stream.

achingbrain commented 1 year ago

The bufferedAmount datachannel property should be able to be used to prevent data loss? That is, we want to close a channel, if it's bufferedAmount property is non-zero we wait until it is before closing?*

From what I can see ~Chrome seems to attempt to send all outstanding data on a channel before resetting it~ (this is not true 🙄), ~libdatachannel doesn't~ (it does!), Firefox does, I guess Pion doesn't?

*= There's a gotcha here though - setting bufferAmountThreshold to 0 as a cheap way to be notified when the outgoing message buffer has drained actually means buferredamountlow events don't fire - https://developer.mozilla.org/en-US/docs/Web/API/RTCDataChannel/bufferedamountlow_event

achingbrain commented 1 year ago

No, after a bit of experimentation it seems even when bufferAmount is 0 data can still be lost, it's not a solution.

thomaseizinger commented 1 year ago

*= There's a gotcha here though - setting bufferAmountThreshold to 0 as a cheap way to be notified when the outgoing message buffer has drained actually means buferredamountlow events don't fire - developer.mozilla.org/en-US/docs/Web/API/RTCDataChannel/bufferedamountlow_event

In our testing, they still fire. Either the docs or the implementations are wrong here!

achingbrain commented 1 year ago

I think you're right - I've opened a PR against the MDN docs.

achingbrain commented 1 year ago

So, fun times, I put a browser demo together here.

To change the settings, just edit the code and wait, it'll rebuild & run.

The demo:

Creates two local WebRTC peers, connects them, opens a datachannel, sends a bunch of messages and counts the number of messages received
If the sender's bufferedAmount gets above a configurable value it waits for the buffer to drain before sending more data
Once all the messages have been sent it waits for the bufferedAmount to be 0, then closes the datachannel

It's necessary to not let the bufferedAmount grow too large because there are opaque send queue limits, from testing Chrome has a max value of 256 messages and Firefox has 64k - these are message counts not total byte length so some guesswork is necessary.

Chrome never seems to send the final few messages, even when we wait for it's final .bufferedAmount to become 0 before closing the channel.

Firefox always sends all messages when .bufferedAmount is 0 before closing the channel.

If we don't drain the channel before closing both Chrome & Firefox drop messages (change drainBeforeClose to see).

achingbrain commented 1 year ago

I've opened an issue against Chromium here: https://bugs.chromium.org/p/chromium/issues/detail?id=1484907

achingbrain commented 1 year ago

I've taken a stab at specifying this in #582 - please take a look and let me know if you spot anything obvious.

It's also being implemented in js-libp2p at https://github.com/libp2p/js-libp2p/pull/2074

libp2p / specs

webrtc: figure out how to properly close datachannels #575

How can we solve this?