Closed vasco-santos closed 1 year ago
The initial post of the issue describes the main areas that need to be improved within the connection manager scope, as well as the order I think they should be tackled. Concrete solutions for some of the problems/features mentioned above still need to be polished.
During the implementation of each milestone, a written artefact should come together with the implementation for alignment of what is the proposed solution and documentation purposes.
Considering the milestones table, I believe that the Milestones 0-3 have a higher priority and would be great to have them for releasing auto-relay
+ rendezvous
. The next milestones might come next
cc @jacobheun
Q: Do we really need a low watermark? We could have a configurable number of connection slots that would be used internally and we should try to be connected to the biggest number of peers possible (below the high watermark - number of connections)
This is really there for preventing the connection manager from culling connections beyond that lower bound. With protected connections this is less important. Once we can tag important peers the proactive dial strategy will change, right now it's just a crude "priority" dial. It would be helpful to flush out what these actually proactive dial strategies are, and document those for clarity. Things like:
n
to them. If peers from the previous search are no longer our closest peers, we should untag those connections, or just let decaying tags handle this.The first 3 here I think are the higher priority in terms of creating a solid set of base connections.
All discovered peers should be dialed, so that we exchange the identify message with them. This enables us to better track what peers are more valuable to be connected to
I don't think this is necessary and it's prone to be very wasteful. If we are proactively searching for peers that will have meaning to us (DHT/rendezvous) we don't need to do this ambient poking of the network, and keep track of who we've dialed. Active searching and connecting is the approach we should take. With larger networks active searching will still work, where as blind dialing to check capabilities starts to fail quickly.
New connections should be given a grace period before they are subject to trimming
Short ttl decay tags would be great for this.
Criteriums to check by order: Number of protocols used Weight of protocol Timestamp of open connection
I'm concerned these might be bad indicators. A connection that belongs in my gossipsub mesh should probably be protected, regardless of the number of other protocols we use on that connection. We're not waiting the protocol itself as lots of peers run gossipsub, we're waiting a specific peer due to its importance in that system. If a subsystem is exceeding its agreed allocation of connections, then we would look at disconnecting peers from it that no other system is using.
Keep Alive
👍 In the majority of cases a ping on that connection should suffice, but we'll need to test this on the different transports. This is also really important for remote listening (webrtc-star, relay, etc).
Suggestion: Should we have a metric, such as, maximumToProtect
This could be added later as needed. If too many peers are being protected it's likely just either a bug in a subsystem or user abuse of something like peering. If subsystems register for connection pools, that could be treated as the max for that system.
Disconnect
There was some initial discussion at https://github.com/libp2p/go-libp2p/issues/238 for the polite disconnect protocol.
General Note It might be worth pushing the Trim Connection updates to after Decay and general tags are in place. It will be a lot more effective if we have meaningful tags in place before making changes there.
Thanks for your thoughts ❤️
It might be worth pushing the Trim Connection updates to after Decay and general tags are in place. It will be a lot more effective if we have meaningful tags in place before making changes there.
Agreed, changed!
Also modified the initial post based on your thoughts. Still need to flush out better the Watermarks observation
Here follow some thoughts on a WIP proposal for the Connection Manager Design. This notes focus mostly on the design to enable Proactive Dial and Better connection trimming. Connection tags and gating might have some intersection here, but they are mostly isolated work, at least in terms of API and Data structures as the other components will only be consumers.
cc @wemeetagain
Connection management can take place in a reactive or proactive fashion. This proposal will be focused on an hybrid approach where the ConnectionManager component will be responsible for a reactive maintenance of connections, according to the available configured pool size. The registrar component will receive topology registrations where each topology will handle the proactive connection management by trying to guarantee that the number of connections is within the configured thresholds. Moreover, once peer and connection scoring is in place, the mentioned components will likely collaborate to create scores and ask for connections/disconnections.
The proactive management of components will replace the current autoDial
option of libp2p. The autoDial
approach just tried to dial blindly any new discovered peer unless the number of connections is outside the configured boundries. This new approach will become more like a traffic shaper where the node will shape its network according to the needs.
For an efficient and easy to use connection management, libp2p will need:
Other considerations:
When a libp2p node starts, it will need to bootstrap to the network and learn about peers that will enable it to fully operate (hopefully more distributed in the future). One of the common ways of doing this is via bootstrap nodes.
These bootstrap nodes are important during the initial lifecycle of the node, but once the node gets to know other peers it should disconnect from them, as the bootstrap nodes will have a lot of requests from other peers. However, they should be disconnected only when enough other peers are connected.
It is worth mentioning that the above might not be always the case. For instance, if a bootstrap node is a relay and the node binds to it for incoming connections, this connection must be protected.
When a libp2p node restarts, it will likely have persisted a set of peers previously discovered. The persisted data will include the known protocols of a peer, as well as its metadata. While this information is not always correct has peers might change the protocols they run or might become offline, it provides enough value to be the first criterium. If the peer can get connected to enough peers for its requirements, it should not get connected to the bootstrap nodes. Moreover, the node should look for peers running a relay and supporting HOP if they have autoRelay enabled.
When a connection with a given node is not needed anymore (example: bootstrap node) or the maximum threshold is reached, a peer will be disconnected. In some cases, this peer might try to reconnect with the peer.
While we do not have a disconnect protocol, we should guarantee that reconnect attempts from these peers are blocked and that when peers try to reconnect they have a exponential backoff and perhaps a configurable maxReconnectAttempts.
For several different reasons, a remote peer might disconnect. If this connection was important, the peer should try to reconnect with an exponential backoff and perhaps a configurable maxReconnectAttempts.
The topology should be responsible for the re-connect
If an inbound connection request is received and the current number of connections is already the MAX_VALUE, the inbound connection should be refused.
The lifecycle of a libp2p node would be the following:
minPeers
connections to those peers (temporary nodes like bootstrap nodes should not count)Please note that the first 2 steps can be skipped (or reduced) if the node had previously been running and already has peers stored in the PeerStore.
The global share of connections can be set in the libp2p connectionManager configuration. Libp2p should have sane defaults (which should evolve with the libp2p configuration effort, where we aim to provide ready to go libp2p configs for several scenarios/runtimes).
const Libp2p = require('libp2p')
const libp2p = await Libp2p.create({
// ...
connectionManager: {
maxConnections: 60,
minConnections: 0,
// TODO: Consider a number of connections that can only be used for libp2p core operations, like connect to rendezvous points, star servers, relays, ...
// ... per https://github.com/libp2p/js-libp2p/blob/master/doc/CONFIGURATION.md#configuring-connection-manager
},
config: {
pubsub: {
// ... https://github.com/libp2p/js-libp2p/blob/master/doc/CONFIGURATION.md#customizing-pubsub
topology: {
min: 10,
max: 30
}
},
// core topologies configuration
}
})
Libp2p core connectivity, such as connections to rendezvous points and to other peers used for listening purposes, should be protected by the relevant components / subsystems (Relay Listener, Rendezvous client).
class ConnectionManager {
constructor ({ max, min }) {
connections: Map<string, Connection[]>;
tags: Map<string, string>;
requestedConnections: number;
}
requestConnectionSlots (amount: number): void;
protect(idStr: String): void;
// TODO: think better about release resources, timings...
requestBurstConnections (amount: number): boolean;
}
Connection Manager is responsible for:
When discovering peers, the context that resulted in the peer being discovered might be important for scoring and for configuring libp2p topologies.
{
peerId,
multiaddrs,
metadata: {
context: Discovery.tag
// Other important metadata
}
}
This context will be useful for setting up decaying tags for bootstrap nodes for example.
Registrar should mediate the interactions between the topologies and the connection manager.
In the begining, it should request the connectionManager slots for the requirements of each topology (maximum and minimum).
It should tag connections used by the topologies to provide visibility to the connection manager for the reactive management of connections.
A topology will need to:
onConnect
should return information for thisLibp2p protocols like Pubsub, DHT or application level protocols can create their own topology. When a topology is created, a min
and max
number of peers can be configured.
const MulticodecTopology = require('libp2p-interfaces/src/topology/multicodec-topology')
// ...
const topology = new MulticodecTopology({
min: 10,
max: 30,
multicodecs: [this.protocol],
handlers: {
onConnect: this._onPeerConnected,
onDisconnect: this._onPeerDisconnected
}
})
this._registrarId = await this._libp2p.registrar.register(topology)
Libp2p will have to deal with less structured topologies, such as Bootstrap nodes. These modules should create topologies in their context and needed use case.
const MetadataTopology = require('libp2p-interfaces/src/topology/metadata-topology')
// ...
const topology = new MetadataTopology({
min: 10,
max: 30,
metadata: [this.metadata],
handlers: {
onConnect: this._onPeerConnected,
onDisconnect: this._onPeerDisconnected
}
})
this._registrarId = await this._libp2p.registrar.register(topology)
// Unregister when not needed
this._libp2p.registrar.unregister(this._registrarId)
Modules like bootstrap should decide to register and unregister according to the PeerStore content and connection tags.
On startup, bootstrap metadata topology should kick in and connect to the bootstrap nodes. Once connections are established, these nodes should be protected while they are important. A decaying tag should be added to them.
Once tags are dropped and the system minimum number of connections is reached, these connections can start to be dropped.
Once all bootstrap connections are dropped the bootstrap metadata topology is unregistered. It can still probably listen for a connection manager event of low number of connections to act and restart?
Removing the kept space for these connections will allow other subsystems to burst with the released resources.
On subsequent starts, the bootstrap should only kick in if not enough peers exist after a given period of time.
TBD
TBD
TBD
Probably there is no need for a metadata topology abstraction layer at this point, and bootstrap can handle itself.
The connection manager would be responsible for distributing tokens for each topology. Dialer would be wrapped inside the connectionManager context and would require a token to be used.
Challenges:
References:
related: https://github.com/libp2p/js-libp2p/issues/426 & https://github.com/ipfs/helia/issues/182 & https://filecoinproject.slack.com/archives/C03K82MU486/p1689794990432059
It doesn't seem like connection-manager / auto-dial has any backoff capabilities currently. I did a quick search and only found backoff functionality in pubsub & pubsub-gossipsub: https://github.com/search?q=repo%3Alibp2p%2Fjs-libp2p%20backoff&type=code
For browser libp2p functionality to work consistently without relying on a specific backend that supports our desired transports (see universal-connectivity needing a specific backend node) we need to optimize auto-dialing and connection attempts.
As discussed in the Open Maintainers call 29-08-23, the scope of this issue is very broad and the connection manager has changed substantially since this was created. There where some valuable suggestions which have been referenced in other issues, namely:
Closing this as this has been broken down into more granular issues.
Connection Manager Overhaul
This Issue is an EPIC to track the work related to the Connection Manager Overhaul. Each milestone context and initial thoughts are described next.
Background
As we land new features like the auto-relay and rendezvous as part of improving connectivity and discoverability in libp2p libp2p/js-libp2p#703, the connection manager overhaul becomes an important work stream to guarantee these protocols work as expected. In addition, this work will be important for some already implemented features/protocols like
webrtc-star
andbootstrap
. Finally, this work is really important to enable the DHT work.This overhaul should be an initial step towards the future ConnMgr v2.
Milestones Overview
0) Documentation - Baseline
1) Watermarks Observation - Proactive Dial
2) Keep Alive
3) Protect Connections - Connection Tags
4) Protect Connections - Decaying Tags
5) Watermarks Observation - Trimming
6) Connection Gater
7) Dial retry
8) Disconnect message
These milestones do not need to be worked on in the displayed sequence. For instance, Connection tags, Connection Gater and Keep Alive can be isolated and implemented.
Context
The Connection manager is responsible for managing all the connections a peer has over time. It allows users to enforce an upper bound on the total number of open connections. To avoid possible service disruptions, connections can be tagged with metadata and optionally "protected" to guarantee that essential connections are kept alive.
0) Documentation - Connection flows
Create a
DISCOVERABILITY_AND_CONNECTIVITY.md
document to be a subsequent to theGETTING_STARTED
document. After someone getting up to speed with how to configure and start libp2p on the getting started document, they should move into how to setup their peer/network according to their use case/environment, in order to enable peers to be discovered and connections with them to be established.This will be divided in two categories:
1) Watermarks observation
Proactive dial
The connection manager proactively dials known peers, in order to have a meaningful set of connections to enable a node to work as expected, according to each use case/environment.
We have been relying on the connection manager low watermark, so that the peer keeps a reasonable number of arbitrary connections. Once we introduce protected connections, as well as tagging important peers, the proactive dial strategy can be modified to keep trying to dial more meaningful peers.
Proactive dial strategies
The following dial strategies should exist:
n
to them. If peers from the previous search are no longer our closest peers, we should untag those connections, or just let decaying tags handle this.The above dial strategies should have sane defaults, but also support to be overwritten. We should have an interval to double check if we have the most meaningful peers connected to, as well as to proactively dial on some events like Peer discovery/disconnect.
TODO: different strategy for Startup/Persistence?
Subsystems should be able to ask the connection manager for a slice of the connection pool. A connection that belongs in my gossipsub mesh should probably be protected
Trim Connections
The connection manager trims less useful connections to be below a high watermark number.
2) Keep Alive
Currently, if a connection does not have anything going on for a while, it will timeout and close. Libp2p should guarantee that specific connections are alive. This is important for keeping connected to peers important to us, both in terms of infrastructure or application layer. Remote listening (webrtc-star, relay, etc) is really important in this context.
Keep Alive should be used for protected peers via the API (Milestone 3) and Peers provided in the configuration.
In most cases, a ping on the connection should be enough, but this needs to be tested for each transport.
3) Protect important connections
ConnManager tracks connections to peers, and allows consumers to associate metadata with each peer. This enables connections to be trimmed based on implementation-defined metadata per peer.
To see: https://github.com/libp2p/js-libp2p/issues/369
Connection tags
API
(based on go interface: https://github.com/libp2p/go-libp2p-core/blob/master/connmgr/manager.go)
tagPeer (peerId: PeerId, tag: string, weight: number) : void
untagPeer (peerId: PeerId, tag: string) : void
getTagInfo (peerId: PeerId) : TagInfo
protect (peerId: PeerId, tag: string)
unProtect (peerId: PeerId, tag: string)
isProtected (peerId: PeerId, tag: string)
Data structures
Integration with Trim connections
Connection tags will allows the trimming to become more intelligent in this stage. Peers should be iterated and the weight of the tags should be used as a first criterium.
4) Decaying tags
Note: Inspired by go-libp2p https://github.com/libp2p/go-libp2p-core/blob/master/connmgr/decay.go
A decaying tag is one whose value automatically decays over time. The decay behaviour is encapsulated in a user-provided decaying function (DecayFn). The function is called on every tick (determined by the interval parameter), and returns either the new value of the tag, or whether it should be erased altogether.
We do not set values on a decaying function, but "bump" decaying tags by a delta value. This calls the BumpFn with the old value and the delta, to determine the new value.
While users should be able to provide their own functions, we should provide some preset functions to be used. Behaviours that are straightforward to implement include:
This is particularly important for scenarios like the Bootstrap discovery. When it starts, these connections are really important to get to know other peers. But as time passes and new connection exist, peers should disconnect from the bootstrap nodes.
API
setDecayingTag(tag: string, interval: time, decayFn: function, bumpFn: function)
5) Connection Gater
TODO: https://github.com/libp2p/go-libp2p-core/blob/master/connmgr/gater.go
Related: #175
6) Connection Retry
Retry a dial if it fails on a first attempt.
7) Disconnect
Sometimes it will be possible to have flows where a peer A wants to disconnect from peer B because it has a lot of connections, all of them more important that the connection with peer B. However, peer B wants to be connected to peer A. A message should be exchanged so that peer B understands that it should not retry it (for a given time?) and eventually a peer exchange. This needs to be spec'ed. Initial discussion at https://github.com/libp2p/go-libp2p/issues/238
Notes
References