Open SgtPooki opened 11 months ago
I believe this is a blocker for https://github.com/ipfs/helia/issues/182 because:
@maschad are you actively working on this?
I'm not actively working on it at the moment @SgtPooki although I think @achingbrain 's PR https://github.com/libp2p/js-libp2p/pull/1947 may be related.
I think #1947 will help with the unstable bit but I can't help but wonder if there's some cleanup we need to do that we're missing to prevent the "Too many pending sessions" thing in the first place.
This may be a bug in Chrome.
When we forcibly close WebTransport connections whose .ready
promise hasn't resolved within the connection timeout (the peer has gone away, is on a slow connection, is overloaded, is firewalled, etc), Chrome may not be cleaning up properly in the background, so we hit this limit.
More details here: https://bugs.chromium.org/p/chromium/issues/detail?id=1473980
I've tried to add a global count to the WebTransport transport to ensure we don't go over 64 "pending" connections, taking "pending" as meaning "has yet to resolve/reject the .ready
/.closed
promises" but it doesn't solve the problem.
Counting the various WebTransport sessions that have been opened and what happened to them, it seems sessions that reject* their .ready
/.closed
promises are still counted as "pending".
Therefore regardless of any limit we set on how many connections we open simultaneously, once the number of errored connections plus the number of yet-to-resolve/reject connections reaches 65 no further WebTransport connections can be opened.
This is bad news and needs a browser fix because once 65 connections have errored it's essentially game over until the page is reloaded.
I've updated the chromium bug report with this information.
* = The rejection reasons are normal network things - an unreachable host, a handshake timeout, etc.
A comment on the Chromium bug links to this design doc - it seems Chromium unilaterally applies an anti-DOS measure by keeping "failed" connections in the "pending" state for 5 minutes after the failure.
This also seems to include sessions that have had their .close
method called before .ready
has resolved - which is how we cancel connections when (for example) dialling a peer on all available addresses then when one dial succeeds, aborting all the other dials.
This seriously limits the amount of connections that can be opened over time.
The Chromium bug is still valid, I think - because the 5 minute delay does not seem to be applied, failed connections are "pending" ~forever~ maybe not forever, but for a lot longer than 5 minutes.
I've put a simple demo page together here that doesn't have any libp2p code in it - https://webtransport-pending-sessions.on.fleek.co/
We can use this to see if the issue has been resolved over time.
Interestingly Firefox does not apply the 5 minute wait though it does crash quite reliably.
I've tried adding a dial queue to the WebTransport transport that applies the 5 minute wait for new dials once 64 have errored, but we request dial slots quicker than the old ones time out so everything sort of grinds to a halt.
We may be able to do something about this by increasing the auto dial retry threshold to something over 5 minutes, this should give Chromium enough time to reach it's internal timeout, after the bug that means it never reaches its internal timeout is fixed ðŸ«
Thanks for staying on top of this one and keeping us updated @achingbrain
@lidel and Javier from Igalia are working with the Chrome team to get a fix into Chrome. Firefox nightly does have WebTransport and seems to work.
link to test page: https://libp2p-webtransport-sessions.on.fleek.co/
Waiting on Igalia to submit a patch to Chrome that fixes this.
Notes from Igalia work stream (under various Handling pending WebTransport sessions
headers): https://hackmd.io/SaJIHZmyRUKfl_fQwoYfog
Sometimes I see thousands instances of this warning in Chrome:
WebTransport session establishment failed. Too many pending WebTransport sessions (64)
This module may need some sort of dial queue to ensure it doesn't open too many connections and trigger this error.
ported over from https://github.com/libp2p/js-libp2p-webtransport/issues/64