libp2p / js-libp2p-stardust

A better ws-star implementation
8 stars 4 forks source link

Dialing a disconnected peer results in unfulfilled promise #28

Open danielrempe-ut opened 4 years ago

danielrempe-ut commented 4 years ago

If you dial a peer address that has disconnected from the network, this line:

const conn = await this._upgrader.upgradeOutbound(maConn)

neither throws any error or ever completes. This causes an upstream issue in libp2p where the transport manager never receives an error and therefore also leaves its promise unfulfilled (in libp2p/transport-manager.js). Because the _pendingDial object in transport-management expects the promise to be resolved or rejected, it actually leaves a hanging pendingDial in the queue forever, so if you try to dial the same peer again, it won't actually reattempt the connection.

I know what you're thinking - just don't dial dead peers and there's no problem! But the addition of lib-kad-dht to the mix adds a further problem, in that it's useful to be able to dial dead peers in a DHT setup because they might come back later and start hosting their content again. Furthermore, because the dial is never aborted due to the unfulfilled promise, kad-dht waits a full 30 seconds per dead peer even if it's able to get good records from other peers in the network.

You can replicate this issue if you have a stardust network with a DHT. Connect a few nodes to the system (3 is good for testing) then pick one to be the dead guy. Have that guy add something to the DHT, then disconnect from the network. If one of the other nodes tries to retrieve the value from the DHT, they will dial the dead node and spin for 30 seconds.

I've been trying to fix the issue for the better part of 2 days and am running into a wall. One idea would be to have the server send a dial error if the target peer ID isn't connected to it, which it is not doing now. Or if there were some way for the client to verify that they're really connected before running the upgrade method.

danielrempe-ut commented 4 years ago

OK, I know I just submitted this, but I fixed the issue. I can't believe this took two whole days of effort.... It turns out you forgot to add E_TARGET_UNREACHABLE to your Error enumeration in proto.js. That's it. That's the entire problem. The enumeration value doesn't exist even though it's used in the server's dial code, so it just returns 0 even if it doesn't have a peer ID connected.

danielrempe-ut commented 4 years ago

I created a PR for this: Pull Request 29