libp2p / js-libp2p

The JavaScript Implementation of libp2p networking stack.
https://libp2p.io
Other
2.3k stars 440 forks source link

Dial queue timeout for multiple multiaddresses incorrectly used for each connection separately #2368

Open akim-bow opened 8 months ago

akim-bow commented 8 months ago

Version: libp2p@1.2.0

Severity:

Medium

Description:

If remote peer proposes multiple multiaddresses, e.g. [/dns/nox-1, /ip4/10.50.10.10, /ip4/127.0.0.1] (only the last one is reachable by default), libp2p will try to connect to them one by one. As you can see in the source code, the shared signal is created and used in every attempt to dial multiaddresses. Hence, the first address could take up all the time until the signal aborts and other addresses would be dropped without an attempt to connect to them.

I'm proposing to use 2 separate signals - first one for batch of multiaddresses, the second one for each multiaddress separately. It will solve the case when a peer provides multiple adresses and only some of them are actually valid. If the description isn't clear enough i can try to build simple reproduction example.

akim-bow commented 8 months ago

@achingbrain can you please look into it?

zeroxbt commented 7 months ago

@achingbrain could you take a look at this ? Most auto dials are failing because of it

achingbrain commented 7 months ago

@akim-bow what kind of addresses are you having problems with?

The example you gave was [/dns/nox-1, /ip4/10.50.10.10, /ip4/127.0.0.1] - DNS addresses take time to resolve, so that can be bad, fair enough, but the other two are private local addresses so assuming TCP, they should be fast to fail?

If you know certain addresses are going to be unreliable or slow, you can also pass an addressSorter as part of the connectionManager config to ensure they are dialled last, or a connectionGater to ensure they are not dialled at all.

zeroxbt commented 7 months ago

@achingbrain I'm having the same issue when running nodes in local environment.

When dialing a peer, these are the addresses found in the peer store:

[/ip4/10.2.0.2/tcp/9106, /ip4/127.0.0.1/tcp/9106, /ip4/172.20.5.94/tcp/9106]

The first address is not reachable, while the other two are. The dialer fails after dialTimeout milliseconds because of the first address, without ever attempting to dial the other two. All auto dials are also failing for the same reason.

achingbrain commented 7 months ago

My concern with the proposed solution is that having separate timeouts for individual addresses could lead to very long dial times.

I wonder if we could be smarter with the dialling, for example if the IP address of the target multiaddr is a class A private network address, deprioritise or entirely skip dialling it unless the current node also has a class A private network address in the same logical network?

zeroxbt commented 7 months ago

Although sorting and filtering addresses is a good option, as a user I would still expect the dial function to attempt dialing all provided addresses. Why not give users the option to choose a maxParallelDialsPerPeer parameter with a default value of 1 ?