libp2p / rust-libp2p

The Rust Implementation of the libp2p networking stack.
https://libp2p.io
MIT License
4.53k stars 940 forks source link

protocols/dcutr: Don't retry when incoming direct connection succeeded #2607

Open mxinden opened 2 years ago

mxinden commented 2 years ago

As of today we retry the DCUtR process when the outgoing direct connection upgrade failed.

In the case where the incoming direction connection upgraded succeeded, there is no need for retrying the process.

We should check direct_connections to see whether an (incoming) direct connection succeeded in the meantime.

https://github.com/libp2p/rust-libp2p/blob/7a1147877a52c7150190acdef35676cdac370299/protocols/dcutr/src/behaviour.rs#L152-L180

esotericbyte commented 2 years ago

I have skimmed: https://pdos.csail.mit.edu/papers/p2pnat.pdf Is the current behavior inefficient or does is lead to more connection failures because the peers pursue different connections? Is timing of the connections and interplay between the network, queued events, and futures executor a relevant concern?

Implement direct_connections check: If there is a direct connection then stop trying to connect , and just drop effort and don't inject any events because it would have generated it's own success event) If there is not then continue to a retry of the process Finish reading the library code used by dcutr behavior Test the code and evaluate the solution Based on testing consider instrumenting with metrics or tracing.

Code reading questions: The check needs to be made before the upgrade process from a relay connection to direct connection is restarted. Precisely where should the check go?

Is this just a single code branch? Keep in mind that rust resolves some of the logic during compile time even if the code is futures code Are the events created by GenerateEvent futures tasks or a separate event queue?
If separate when they are handled do they create tasks and thus placed on a secondary queue? What happens if both connections succeed? ( believe the direct_connections is a hash from ids to connections and so it would contain more than one connection on each end for their respective ids. ) If the the incoming direction connection upgrade is in process how many success cases may still be missed even if direct_connections is checked? (tasks are executed in parallel) What is the latency of various queues? What applications use this code that might be analyzed to understand under various loads? Under various loads and contexts do these queues actually cause significant delays or are the delays from internal queues negligible? If not: Look for a diagram of network messages combined with event & event queue activity, and if there is none then make a diagram.

mxinden commented 2 years ago

Is the current behavior inefficient or does is lead to more connection failures because the peers pursue different connections?

Just inefficient.

Is timing of the connections and interplay between the network, queued events, and futures executor a relevant concern?

Probably relevant, though we will be collecting more data soon to back up this point. See https://github.com/dennis-tra/punchr.

If there is a direct connection then stop trying to connect , and just drop effort

Yes, that sounds right.

Precisely where should the check go?

inject_dial_failure should work.

If separate when they are handled do they create tasks and thus placed on a secondary queue?

There is a single task running all NetworkBehaviour and Swarm logic plus one task per connection.

What happens if both connections succeed? ( believe the direct_connections is a hash from ids to connections and so it would contain more than one connection on each end for their respective ids. )

For now, one connection would eventually idle and thus be closed. In the long run we would need https://github.com/libp2p/specs/issues/328.

What is the latency of various queues? What applications use this code that might be analyzed to understand under various loads? Under various loads and contexts do these queues actually cause significant delays or are the delays from internal queues negligible? If not: Look for a diagram of network messages combined with event & event queue activity, and if there is none then make a diagram.

Unfortunately I don't have data on this. Based on intuition and past experience I would guess it is negligible given that network latencies are an order of magnitude higher.

normanade commented 2 years ago

Actually, the direct_connections hashmap cannot be reached from outside the Behaviour struct, making the pub struct rather useless.

pub struct Behaviour {
    /// Queue of actions to return when polled.
    queued_actions: VecDeque<ActionBuilder>,

    /// All direct (non-relayed) connections.
    direct_connections: HashMap<PeerId, HashSet<ConnectionId>>,
}
thomaseizinger commented 2 years ago

Actually, the direct_connections hashmap cannot be reached from outside the Behaviour struct

No need to access it from the outside. The behaviour gets notified about new incoming connections!

esotericbyte commented 1 year ago

Looking back I am beguiled by how to debug multi-node issues like this in libp2p or an app where the network infrastructure like routers is involved. I probably should have asked about it sooner. Are there docs on how to set up automated testing or is it necessary to recruit machines?

mxinden commented 1 year ago

@esotericbyte unfortunately we don't have automated testing in place here today. You can follow progress here https://github.com/libp2p/test-plans/issues/21.

thomaseizinger commented 11 months ago

Update: We now have automated tests for hole-punching!