vasco-santos commented 4 years ago

Connection Manager Overhaul

This Issue is an EPIC to track the work related to the Connection Manager Overhaul. Each milestone context and initial thoughts are described next.

Background

As we land new features like the auto-relay and rendezvous as part of improving connectivity and discoverability in libp2p libp2p/js-libp2p#703, the connection manager overhaul becomes an important work stream to guarantee these protocols work as expected. In addition, this work will be important for some already implemented features/protocols like webrtc-star and bootstrap. Finally, this work is really important to enable the DHT work.

This overhaul should be an initial step towards the future ConnMgr v2.

Milestones Overview

Milestone	Issue	PR	State
`0) Documentation - Baseline`	NA	#757	WIP
`1) Watermarks Observation - Proactive Dial`	TODO	TODO	TODO
`2) Keep Alive`	TODO	TODO	TODO
`3) Protect Connections - Connection Tags`	TODO	TODO	TODO
`4) Protect Connections - Decaying Tags`	TODO	TODO	TODO
`5) Watermarks Observation - Trimming`	TODO	TODO	TODO
`6) Connection Gater`	TODO	#1142	Done
`7) Dial retry`	TODO	TODO	TODO
`8) Disconnect message`	TODO	TODO	TODO

These milestones do not need to be worked on in the displayed sequence. For instance, Connection tags, Connection Gater and Keep Alive can be isolated and implemented.

Context

The Connection manager is responsible for managing all the connections a peer has over time. It allows users to enforce an upper bound on the total number of open connections. To avoid possible service disruptions, connections can be tagged with metadata and optionally "protected" to guarantee that essential connections are kept alive.

0) Documentation - Connection flows

Create a DISCOVERABILITY_AND_CONNECTIVITY.md document to be a subsequent to the GETTING_STARTED document. After someone getting up to speed with how to configure and start libp2p on the getting started document, they should move into how to setup their peer/network according to their use case/environment, in order to enable peers to be discovered and connections with them to be established.

This will be divided in two categories:

define a baseline of what is a desirable set of connections for each environment / use case
improve current documentation to clarify some flows like the webrtc-star server
- context: https://github.com/ipfs/js-ipfs/issues/3235
- use own webrtc-star server

1) Watermarks observation

Proactive dial

The connection manager proactively dials known peers, in order to have a meaningful set of connections to enable a node to work as expected, according to each use case/environment.

We have been relying on the connection manager low watermark, so that the peer keeps a reasonable number of arbitrary connections. Once we introduce protected connections, as well as tagging important peers, the proactive dial strategy can be modified to keep trying to dial more meaningful peers.

Proactive dial strategies

The following dial strategies should exist:

Find our closest peers on the network, and attempt to stay connected to n to them. If peers from the previous search are no longer our closest peers, we should untag those connections, or just let decaying tags handle this.
Finding, connecting to and protecting our gossipsub peers (same topics search)
Finding and binding to relays with AutoRelay
Finding and binding to application protocol peers (as needed via MulticodecTopology) -- We should clarify what libp2p will handle intrinsically and what users need to do. Ideally, I think libp2p should search for multicodecs for registered topologies automatically.
...

The above dial strategies should have sane defaults, but also support to be overwritten. We should have an interval to double check if we have the most meaningful peers connected to, as well as to proactively dial on some events like Peer discovery/disconnect.

TODO: different strategy for Startup/Persistence?

Subsystems should be able to ask the connection manager for a slice of the connection pool. A connection that belongs in my gossipsub mesh should probably be protected

TODO: Figure out API for interaction between subsystems/topologies and connMgr
Subsystems might want to provide a selector function to choose a peer they care want. AutoRelay will want to check if a peer has metadata with hop = true

Trim Connections

The connection manager trims less useful connections to be below a high watermark number.

New connections should be given a grace period before they are subject to trimming - Short ttl decay tags
Trimming automatically run on demand
- Verification on every Peer connect event
- Attempt to keep a balance between subsystems connections and their needs
- If a subsystem is exceeding its agreed allocation of connections, then we would look at disconnecting peers from it that no other system is using.

2) Keep Alive

Currently, if a connection does not have anything going on for a while, it will timeout and close. Libp2p should guarantee that specific connections are alive. This is important for keeping connected to peers important to us, both in terms of infrastructure or application layer. Remote listening (webrtc-star, relay, etc) is really important in this context.

Keep Alive should be used for protected peers via the API (Milestone 3) and Peers provided in the configuration.

In most cases, a ping on the connection should be enough, but this needs to be tested for each transport.

3) Protect important connections

ConnManager tracks connections to peers, and allows consumers to associate metadata with each peer. This enables connections to be trimmed based on implementation-defined metadata per peer.

To see: https://github.com/libp2p/js-libp2p/issues/369

Connection tags

API

(based on go interface: https://github.com/libp2p/go-libp2p-core/blob/master/connmgr/manager.go)

Tag a peer with a string, associating a weight with the tag.
- tagPeer (peerId: PeerId, tag: string, weight: number) : void
Untag removes the tagged value from the peer.
- untagPeer (peerId: PeerId, tag: string) : void
Get the metadata assicuated with the peer connection
- getTagInfo (peerId: PeerId) : TagInfo
- tagInfo should be stored in the metadataBook
Protect a peer from having its connection(s) pruned.
- protect (peerId: PeerId, tag: string)
- This would need to return a boolean or throw
Unprotect a peer from having its connection(s) pruned.
- unProtect (peerId: PeerId, tag: string)
Check if a peer connection is protected.
- isProtected (peerId: PeerId, tag: string)

Data structures

/**
 * TagInfo object stores metadata associated with a peer
 * @typedef {Object} TagInfo
 * @property {Map<string, number>} tags map with tags and their current weight
 * @property {number} firstSeen timestamp of first connection establishment.
 * @property {number} weight seq counter.
 */

Integration with Trim connections

Connection tags will allows the trimming to become more intelligent in this stage. Peers should be iterated and the weight of the tags should be used as a first criterium.

4) Decaying tags

Note: Inspired by go-libp2p https://github.com/libp2p/go-libp2p-core/blob/master/connmgr/decay.go

A decaying tag is one whose value automatically decays over time. The decay behaviour is encapsulated in a user-provided decaying function (DecayFn). The function is called on every tick (determined by the interval parameter), and returns either the new value of the tag, or whether it should be erased altogether.

We do not set values on a decaying function, but "bump" decaying tags by a delta value. This calls the BumpFn with the old value and the delta, to determine the new value.

While users should be able to provide their own functions, we should provide some preset functions to be used. Behaviours that are straightforward to implement include:

Decay a tag by -1, or by half its current value, on every tick.
Every time a value is bumped, sum it to its current value.
Exponentially boost a score with every bump.
Sum the incoming score, but keep it within min, max bounds.

This is particularly important for scenarios like the Bootstrap discovery. When it starts, these connections are really important to get to know other peers. But as time passes and new connection exist, peers should disconnect from the bootstrap nodes.

API

setDecayingTag(tag: string, interval: time, decayFn: function, bumpFn: function)

// DecayFn applies a decay to the peer's score. The implementation must call
// DecayFn at the interval supplied when registering the tag.
//
// It receives a copy of the decaying value, and returns the score after
// applying the decay, as well as a flag to signal if the tag should be erased.
type DecayFn func(value DecayingValue) (after int, rm bool)

// BumpFn applies a delta onto an existing score, and returns the new score.
//
// Non-trivial bump functions include exponential boosting, moving averages,
// ceilings, etc.
type BumpFn func(value DecayingValue, delta int) (after int)

5) Connection Gater

TODO: https://github.com/libp2p/go-libp2p-core/blob/master/connmgr/gater.go

Related: #175

6) Connection Retry

Retry a dial if it fails on a first attempt.

7) Disconnect

Sometimes it will be possible to have flows where a peer A wants to disconnect from peer B because it has a lot of connections, all of them more important that the connection with peer B. However, peer B wants to be connected to peer A. A message should be exchanged so that peer B understands that it should not retry it (for a given time?) and eventually a peer exchange. This needs to be spec'ed. Initial discussion at https://github.com/libp2p/go-libp2p/issues/238

Notes

Subsystems, such as pubsub, auto-relay, should provide a function to rank what peers they would like to have connections with.

References

vasco-santos commented 4 years ago

The initial post of the issue describes the main areas that need to be improved within the connection manager scope, as well as the order I think they should be tackled. Concrete solutions for some of the problems/features mentioned above still need to be polished.

During the implementation of each milestone, a written artefact should come together with the implementation for alignment of what is the proposed solution and documentation purposes.

Considering the milestones table, I believe that the Milestones 0-3 have a higher priority and would be great to have them for releasing auto-relay + rendezvous. The next milestones might come next

cc @jacobheun

jacobheun commented 4 years ago

Q: Do we really need a low watermark? We could have a configurable number of connection slots that would be used internally and we should try to be connected to the biggest number of peers possible (below the high watermark - number of connections)

This is really there for preventing the connection manager from culling connections beyond that lower bound. With protected connections this is less important. Once we can tag important peers the proactive dial strategy will change, right now it's just a crude "priority" dial. It would be helpful to flush out what these actually proactive dial strategies are, and document those for clarity. Things like:

On an interval (10m+) find our closest peers on the network, and attempt to stay connected to n to them. If peers from the previous search are no longer our closest peers, we should untag those connections, or just let decaying tags handle this.
Finding, connecting to and protecting our gossipsub peers
Finding and binding to relays with AutoRelay
Finding and binding to application protocol peers (as needed via MulticodecTopology). We should clarify what libp2p will handle intrinsically and what users need to do. Ideally, I think libp2p should search for multicodecs for registered topologies automatically.

The first 3 here I think are the higher priority in terms of creating a solid set of base connections.

All discovered peers should be dialed, so that we exchange the identify message with them. This enables us to better track what peers are more valuable to be connected to

I don't think this is necessary and it's prone to be very wasteful. If we are proactively searching for peers that will have meaning to us (DHT/rendezvous) we don't need to do this ambient poking of the network, and keep track of who we've dialed. Active searching and connecting is the approach we should take. With larger networks active searching will still work, where as blind dialing to check capabilities starts to fail quickly.

New connections should be given a grace period before they are subject to trimming

Short ttl decay tags would be great for this.

Criteriums to check by order: Number of protocols used Weight of protocol Timestamp of open connection

I'm concerned these might be bad indicators. A connection that belongs in my gossipsub mesh should probably be protected, regardless of the number of other protocols we use on that connection. We're not waiting the protocol itself as lots of peers run gossipsub, we're waiting a specific peer due to its importance in that system. If a subsystem is exceeding its agreed allocation of connections, then we would look at disconnecting peers from it that no other system is using.

Keep Alive

👍 In the majority of cases a ping on that connection should suffice, but we'll need to test this on the different transports. This is also really important for remote listening (webrtc-star, relay, etc).

Suggestion: Should we have a metric, such as, maximumToProtect

This could be added later as needed. If too many peers are being protected it's likely just either a bug in a subsystem or user abuse of something like peering. If subsystems register for connection pools, that could be treated as the max for that system.

Disconnect

There was some initial discussion at https://github.com/libp2p/go-libp2p/issues/238 for the polite disconnect protocol.

General Note It might be worth pushing the Trim Connection updates to after Decay and general tags are in place. It will be a lot more effective if we have meaningful tags in place before making changes there.

vasco-santos commented 4 years ago

Thanks for your thoughts ❤️

It might be worth pushing the Trim Connection updates to after Decay and general tags are in place. It will be a lot more effective if we have meaningful tags in place before making changes there.

Agreed, changed!

Also modified the initial post based on your thoughts. Still need to flush out better the Watermarks observation

vasco-santos commented 3 years ago

Here follow some thoughts on a WIP proposal for the Connection Manager Design. This notes focus mostly on the design to enable Proactive Dial and Better connection trimming. Connection tags and gating might have some intersection here, but they are mostly isolated work, at least in terms of API and Data structures as the other components will only be consumers.

cc @wemeetagain

Connection Manager + Registrar Design Proposal

Overview

Connection management can take place in a reactive or proactive fashion. This proposal will be focused on an hybrid approach where the ConnectionManager component will be responsible for a reactive maintenance of connections, according to the available configured pool size. The registrar component will receive topology registrations where each topology will handle the proactive connection management by trying to guarantee that the number of connections is within the configured thresholds. Moreover, once peer and connection scoring is in place, the mentioned components will likely collaborate to create scores and ask for connections/disconnections.

The proactive management of components will replace the current autoDial option of libp2p. The autoDial approach just tried to dial blindly any new discovered peer unless the number of connections is outside the configured boundries. This new approach will become more like a traffic shaper where the node will shape its network according to the needs.

For an efficient and easy to use connection management, libp2p will need:

Global connection pool
Declaration and management of usage quotas
Quota usage supervision and regulation, either
- proactively: by having consumers check out resources when needed
- reactively: by monitoring usage and taking compensating and rebalancing actions upon breach of quota.
Optimal connection control decision-taking
- should come next with scoring
Connection allocation observability

Other considerations:

Connections might be reused by multiple topologies at the same time
- these connections should not be counted twice and should be prioritized when trimming
On reconnect a backoff should exist
A connection should have a grace period
Other protected connections
- Discovery context ...
- PeerStore should store context of discovery in metadata?

Flows

First Start (with bootstrap discovery module)

When a libp2p node starts, it will need to bootstrap to the network and learn about peers that will enable it to fully operate (hopefully more distributed in the future). One of the common ways of doing this is via bootstrap nodes.

These bootstrap nodes are important during the initial lifecycle of the node, but once the node gets to know other peers it should disconnect from them, as the bootstrap nodes will have a lot of requests from other peers. However, they should be disconnected only when enough other peers are connected.

It is worth mentioning that the above might not be always the case. For instance, if a bootstrap node is a relay and the node binds to it for incoming connections, this connection must be protected.

Subsequent Starts (with populated PeerStore)

When a libp2p node restarts, it will likely have persisted a set of peers previously discovered. The persisted data will include the known protocols of a peer, as well as its metadata. While this information is not always correct has peers might change the protocols they run or might become offline, it provides enough value to be the first criterium. If the peer can get connected to enough peers for its requirements, it should not get connected to the bootstrap nodes. Moreover, the node should look for peers running a relay and supporting HOP if they have autoRelay enabled.

Preemptive Disconnect

When a connection with a given node is not needed anymore (example: bootstrap node) or the maximum threshold is reached, a peer will be disconnected. In some cases, this peer might try to reconnect with the peer.

While we do not have a disconnect protocol, we should guarantee that reconnect attempts from these peers are blocked and that when peers try to reconnect they have a exponential backoff and perhaps a configurable maxReconnectAttempts.

Remote disconnect

For several different reasons, a remote peer might disconnect. If this connection was important, the peer should try to reconnect with an exponential backoff and perhaps a configurable maxReconnectAttempts.

The topology should be responsible for the re-connect

Remote connect with max pool

If an inbound connection request is received and the current number of connections is already the MAX_VALUE, the inbound connection should be refused.

Libp2p Node Connection Lifecycle overview

The lifecycle of a libp2p node would be the following:

Bootstrap to the network
Actively discover peers that use the peer running protocols (via rendezvous or other means - Discovery API) and closest peers
Actively establish a connection with the above peers
Continue until we have at least minPeers connections to those peers (temporary nodes like bootstrap nodes should not count)
Create n+1 overlay networks with those peers, depending on the needs and quantities of those protocols
Prioritize all overlay network connections
Switch to passive discovery

Please note that the first 2 steps can be skipped (or reduced) if the node had previously been running and already has peers stored in the PeerStore.

Implementation

Libp2p configuration

The global share of connections can be set in the libp2p connectionManager configuration. Libp2p should have sane defaults (which should evolve with the libp2p configuration effort, where we aim to provide ready to go libp2p configs for several scenarios/runtimes).

const Libp2p = require('libp2p')

const libp2p = await Libp2p.create({
  // ...
  connectionManager: {
    maxConnections: 60,
    minConnections: 0,
    // TODO: Consider a number of connections that can only be used for libp2p core operations, like connect to rendezvous points, star servers, relays, ...
    // ... per https://github.com/libp2p/js-libp2p/blob/master/doc/CONFIGURATION.md#configuring-connection-manager
  },
  config: {
    pubsub: {
      // ... https://github.com/libp2p/js-libp2p/blob/master/doc/CONFIGURATION.md#customizing-pubsub
      topology: {
        min: 10,
        max: 30
      }
    },
    // core topologies configuration
  }
})

Libp2p core connectivity, such as connections to rendezvous points and to other peers used for listening purposes, should be protected by the relevant components / subsystems (Relay Listener, Rendezvous client).

Libp2p Connection Manager

class ConnectionManager {
  constructor ({ max, min }) {
    connections: Map<string, Connection[]>;
    tags: Map<string, string>;
    requestedConnections: number;
  }

  requestConnectionSlots (amount: number): void;

  protect(idStr: String): void;

  // TODO: think better about release resources, timings...
  requestBurstConnections (amount: number): boolean;
}

Connection Manager is responsible for:

throw error if too many initial connections are requested by their consumers
- Should keep track of current requestedConnections
reactive approach to manage connections at a higher level.
- track number of connections established everytime a new connection is established
trim connections if maximum threshold was reached
- not trim protected connections
- priority to not trim overlay connections
- if no other connections to trim, request topology with more connections
(Future work) support topology requests to have some extra connections for a burst

Libp2p Discovery

When discovering peers, the context that resulted in the peer being discovered might be important for scoring and for configuring libp2p topologies.

{
  peerId,
  multiaddrs,
  metadata: {
    context: Discovery.tag
    // Other important metadata
  }
}

This context will be useful for setting up decaying tags for bootstrap nodes for example.

Registrar

Registrar should mediate the interactions between the topologies and the connection manager.

In the begining, it should request the connectionManager slots for the requirements of each topology (maximum and minimum).

It should tag connections used by the topologies to provide visibility to the connection manager for the reactive management of connections.

Libp2p Topologies

A topology will need to:

Keep track of peers known to run the protocol / have given metadata
Keep track of peers connected within its context
- onConnect should return information for this
(Future work) Provide sort/discover strategies
- We should provide defaults
- Look in the DHT/Relay for X

Multicodec Topologies

Libp2p protocols like Pubsub, DHT or application level protocols can create their own topology. When a topology is created, a min and max number of peers can be configured.

const MulticodecTopology = require('libp2p-interfaces/src/topology/multicodec-topology')

// ...

const topology = new MulticodecTopology({
  min: 10,
  max: 30,
  multicodecs: [this.protocol],
  handlers: {
    onConnect: this._onPeerConnected,
    onDisconnect: this._onPeerDisconnected
  }
})

this._registrarId = await this._libp2p.registrar.register(topology)

MetadataTopology

Libp2p will have to deal with less structured topologies, such as Bootstrap nodes. These modules should create topologies in their context and needed use case.

const MetadataTopology = require('libp2p-interfaces/src/topology/metadata-topology')

// ...

const topology = new MetadataTopology({
  min: 10,
  max: 30,
  metadata: [this.metadata],
  handlers: {
    onConnect: this._onPeerConnected,
    onDisconnect: this._onPeerDisconnected
  }
})

this._registrarId = await this._libp2p.registrar.register(topology)

// Unregister when not needed
this._libp2p.registrar.unregister(this._registrarId)

Modules like bootstrap should decide to register and unregister according to the PeerStore content and connection tags.

Flows

First Start

On startup, bootstrap metadata topology should kick in and connect to the bootstrap nodes. Once connections are established, these nodes should be protected while they are important. A decaying tag should be added to them.

Once tags are dropped and the system minimum number of connections is reached, these connections can start to be dropped.

Once all bootstrap connections are dropped the bootstrap metadata topology is unregistered. It can still probably listen for a connection manager event of low number of connections to act and restart?

Removing the kept space for these connections will allow other subsystems to burst with the released resources.

Subsequent Starts (with populated PeerStore)

On subsequent starts, the bootstrap should only kick in if not enough peers exist after a given period of time.

Discover strategies

TBD

Tags + Decaying Tags

TBD

Connection Gating

TBD

Alternative Designs

Do not create the Metadata Topology

Probably there is no need for a metadata topology abstraction layer at this point, and bootstrap can handle itself.

"Token based" connection manager

The connection manager would be responsible for distributing tokens for each topology. Dialer would be wrapped inside the connectionManager context and would require a token to be used.

Challenges:

Peers connected in "multiple contexts"

Future Work

Connection/Peer scoring
Support churn in topologies
- Topologies can ask for some extra conections for a small interval. E.g DHT Random Walk is going to start

References:

Topologies + Conn Manager
- https://github.com/libp2p/notes/issues/13#issue-463852516
Conn Manager v2
- https://github.com/libp2p/specs/pull/161/files
Tagging
- https://github.com/libp2p/js-libp2p/issues/369

SgtPooki commented 1 year ago

It doesn't seem like connection-manager / auto-dial has any backoff capabilities currently. I did a quick search and only found backoff functionality in pubsub & pubsub-gossipsub: https://github.com/search?q=repo%3Alibp2p%2Fjs-libp2p%20backoff&type=code

For browser libp2p functionality to work consistently without relying on a specific backend that supports our desired transports (see universal-connectivity needing a specific backend node) we need to optimize auto-dialing and connection attempts.

maschad commented 1 year ago

As discussed in the Open Maintainers call 29-08-23, the scope of this issue is very broad and the connection manager has changed substantially since this was created. There where some valuable suggestions which have been referenced in other issues, namely:

Closing this as this has been broken down into more granular issues.

libp2p / js-libp2p