libp2p / js-libp2p

The JavaScript Implementation of libp2p networking stack.
https://libp2p.io
Other
2.26k stars 436 forks source link

bug: WebTransport session establishment failed. Too many pending WebTransport sessions (64) #1896

Open SgtPooki opened 11 months ago

SgtPooki commented 11 months ago

Sometimes I see thousands instances of this warning in Chrome:

WebTransport session establishment failed. Too many pending WebTransport sessions (64)

image

This module may need some sort of dial queue to ensure it doesn't open too many connections and trigger this error.

ported over from https://github.com/libp2p/js-libp2p-webtransport/issues/64

SgtPooki commented 11 months ago

I believe this is a blocker for https://github.com/ipfs/helia/issues/182 because:

  1. webtransport is one of the few consistent ways we can connect to other nodes from the browser.
  2. A Helia nodes' connectivity become unstable once we have too many pending dials.

@maschad are you actively working on this?

maschad commented 11 months ago

I'm not actively working on it at the moment @SgtPooki although I think @achingbrain 's PR https://github.com/libp2p/js-libp2p/pull/1947 may be related.

achingbrain commented 11 months ago

I think #1947 will help with the unstable bit but I can't help but wonder if there's some cleanup we need to do that we're missing to prevent the "Too many pending sessions" thing in the first place.

achingbrain commented 10 months ago

This may be a bug in Chrome.

When we forcibly close WebTransport connections whose .ready promise hasn't resolved within the connection timeout (the peer has gone away, is on a slow connection, is overloaded, is firewalled, etc), Chrome may not be cleaning up properly in the background, so we hit this limit.

More details here: https://bugs.chromium.org/p/chromium/issues/detail?id=1473980

achingbrain commented 10 months ago

I've tried to add a global count to the WebTransport transport to ensure we don't go over 64 "pending" connections, taking "pending" as meaning "has yet to resolve/reject the .ready/.closed promises" but it doesn't solve the problem.

Counting the various WebTransport sessions that have been opened and what happened to them, it seems sessions that reject* their .ready/.closed promises are still counted as "pending".

Therefore regardless of any limit we set on how many connections we open simultaneously, once the number of errored connections plus the number of yet-to-resolve/reject connections reaches 65 no further WebTransport connections can be opened.

This is bad news and needs a browser fix because once 65 connections have errored it's essentially game over until the page is reloaded.

I've updated the chromium bug report with this information.


* = The rejection reasons are normal network things - an unreachable host, a handshake timeout, etc.

achingbrain commented 10 months ago

A comment on the Chromium bug links to this design doc - it seems Chromium unilaterally applies an anti-DOS measure by keeping "failed" connections in the "pending" state for 5 minutes after the failure.

This also seems to include sessions that have had their .close method called before .ready has resolved - which is how we cancel connections when (for example) dialling a peer on all available addresses then when one dial succeeds, aborting all the other dials.

This seriously limits the amount of connections that can be opened over time.

The Chromium bug is still valid, I think - because the 5 minute delay does not seem to be applied, failed connections are "pending" ~forever~ maybe not forever, but for a lot longer than 5 minutes.

I've put a simple demo page together here that doesn't have any libp2p code in it - https://webtransport-pending-sessions.on.fleek.co/

We can use this to see if the issue has been resolved over time.

Interestingly Firefox does not apply the 5 minute wait though it does crash quite reliably.

I've tried adding a dial queue to the WebTransport transport that applies the 5 minute wait for new dials once 64 have errored, but we request dial slots quicker than the old ones time out so everything sort of grinds to a halt.

We may be able to do something about this by increasing the auto dial retry threshold to something over 5 minutes, this should give Chromium enough time to reach it's internal timeout, after the bug that means it never reaches its internal timeout is fixed 🫠

SgtPooki commented 10 months ago

Thanks for staying on top of this one and keeping us updated @achingbrain

achingbrain commented 4 months ago

Ref: https://github.com/ipfs/in-web-browsers/issues/211#issuecomment-1953218400

dhuseby commented 2 months ago

@lidel and Javier from Igalia are working with the Chrome team to get a fix into Chrome. Firefox nightly does have WebTransport and seems to work.

dhuseby commented 2 months ago

link to test page: https://libp2p-webtransport-sessions.on.fleek.co/

dhuseby commented 2 months ago

Waiting on Igalia to submit a patch to Chrome that fixes this.

achingbrain commented 1 month ago

Notes from Igalia work stream (under various Handling pending WebTransport sessions headers): https://hackmd.io/SaJIHZmyRUKfl_fQwoYfog