Closed marten-seemann closed 11 months ago
The bufferedAmount datachannel property should be able to be used to prevent data loss? That is, we want to close a channel, if it's bufferedAmount
property is non-zero we wait until it is before closing?*
From what I can see ~Chrome seems to attempt to send all outstanding data on a channel before resetting it~ (this is not true 🙄), ~libdatachannel doesn't~ (it does!), Firefox does, I guess Pion doesn't?
*= There's a gotcha here though - setting bufferAmountThreshold
to 0
as a cheap way to be notified when the outgoing message buffer has drained actually means buferredamountlow
events don't fire - https://developer.mozilla.org/en-US/docs/Web/API/RTCDataChannel/bufferedamountlow_event
No, after a bit of experimentation it seems even when bufferAmount
is 0 data can still be lost, it's not a solution.
*= There's a gotcha here though - setting
bufferAmountThreshold
to0
as a cheap way to be notified when the outgoing message buffer has drained actually meansbuferredamountlow
events don't fire - developer.mozilla.org/en-US/docs/Web/API/RTCDataChannel/bufferedamountlow_event
In our testing, they still fire. Either the docs or the implementations are wrong here!
I think you're right - I've opened a PR against the MDN docs.
So, fun times, I put a browser demo together here.
To change the settings, just edit the code and wait, it'll rebuild & run.
The demo:
bufferedAmount
gets above a configurable value it waits for the buffer to drain before sending more databufferedAmount
to be 0
, then closes the datachannelIt's necessary to not let the bufferedAmount
grow too large because there are opaque send queue limits, from testing Chrome has a max value of 256 messages and Firefox has 64k - these are message counts not total byte length so some guesswork is necessary.
Chrome never seems to send the final few messages, even when we wait for it's final .bufferedAmount
to become 0
before closing the channel.
Firefox always sends all messages when .bufferedAmount
is 0
before closing the channel.
If we don't drain the channel before closing both Chrome & Firefox drop messages (change drainBeforeClose
to see).
I've opened an issue against Chromium here: https://bugs.chromium.org/p/chromium/issues/detail?id=1484907
I've taken a stab at specifying this in #582 - please take a look and let me know if you spot anything obvious.
It's also being implemented in js-libp2p at https://github.com/libp2p/js-libp2p/pull/2074
Closing WebRTC datachannels is not straightforward. When closing the datachannel, it is instead reset, leading to data sent on that datachannel (potentially) being discarded. See RFC 8831: https://datatracker.ietf.org/doc/html/rfc8831#name-closing-a-data-channel:
This has led to interop failures in go-libp2p, and delayed the release of WebRTC from the go-libp2p v0.31 release. See https://github.com/libp2p/go-libp2p/pull/2337#issuecomment-1696619731 for details.
The problem is that if we don't close a datachannel, the underlying WebRTC / SCTP stack will need to continue keeping track of that datachannel / stream, leading to a memory leak, since a WebRTC connection can use up to 64k streams per association. This is a known bug in rust-libp2p: https://github.com/libp2p/rust-libp2p/issues/4333.
How can we solve this?
This is annoying, and yet another time where WebRTC's suboptimal stream state machine is coming to bite us (can't wait for a WebRTC running on top of QUIC!).
First, here's a workaround: Limit the total number of streams to a value lower than 64k, e.g. 500. This is roughly the number of concurrent streams that other stream multiplexers allow, and it allows for some initial interaction between two peers. Important: Once that number is reached, close the underlying WebRTC connection (don't just return an error from the
NewStream
method). This might lead to some unacknowledged data to be lost, but peers going away in the middle of a transfer is something that application protocols need to deal with anyway. Obviously, this is not ideal, but it's better than releasing code containing a significant memory leak.Here's how we could actually solve this: Assuming there's no better way to reliably close a stream, we could add some additional signaling on the libp2p layer. We already use a Protobuf to mirror parts of the QUIC state machine, and this would just be one additional message. Specifically, we'd need to add an acknowledgement for the FIN. Once the FIN-ACK has been received, a peer could actually close the underlying datachannel. We'll need to carefully check how this affects the send and the receive side of the stream.