Improved Connection Manager

Stebalien commented 6 years ago

The current connection manager is keeping our nodes from running out of memory but is breaking everything else pretty badly. Unfortunately, it's rather tricky to correctly use the current interface.

I'd like to do the following:

Introduce some constants to make relative weights clear. That is, predefine const IMPORTANT = 100 etc.
Introduce stream tracking. By default, any open stream on a connection would keep the connection open. However, we'd provide a way to deactivate/activate streams. This way, the DHT can demote its streams and user applications won't get their connections cut when they're actually using them. Once we get multistream 2.0, we can probably move to a "open a stream if you're going to use it and close it when you're done with it" model but we're not there yet.
Introduce some concept of decaying weight. Each time we talk to the peer, we'd "refresh" the weight. Bitswap would likely do this when sending blocks to a peer.

Anyone have bandwidth for this? It'll take a bit of work to get it right but it should help immensely.

raulk commented 6 years ago

The connection manager needs a lot of love. I suspect it’s a strong culprit in some of the erratic behaviour we see. Indeed we need better semantics and feedback loops between the connection manager and the streams.

This is a P1 for me this Q4, and I’ve been evaluating different approaches. My conclusions so far:

Streams should be able to lock the connection for a specific time window (with maximum bounds). They’d register a callback function that gets notified when the lock elapses and the conn manager is about to shut the connection. They can extend the lock (with some limit), or take a compensating action, e.g. initiate a connection with another bitswap peer.
The scoring/weighting system is too simplistic. Instead of the current read-write tag system, my current thinking involves protocols being able to attach private structs that implement a Score() int64 interface. The structs can track latencies, error rates, usefulness, etc. and other protocol-specific stuff. The Score() function returns a score 0-1000 applying whatever formula makes sense for the protocol. When the connection manager starts a round of pruning, it calls Score() on all attached structs and picks the lowest scoring ones. This pull mechanism is more efficient, cleaner and contextual than the current one.

I’ll try to put together an interface with these ideas and yours in the next day, if that works.

P.S.: Written on mobile, so excuse any sloppiness.

raulk commented 6 years ago

The concept of decaying would be part of point 2. It’s going to look different for each protocol, so I’d stay away of modelling this on the connection manager, and instead lean towards protocols tracking peer usefulness and liveliness in the scoring struct, and using that value in the scoring function.

Stebalien commented 6 years ago

This is a P1 for me this Q4

Awesome!

Streams should be able to lock the connection for a specific time window (with maximum bounds). They’d register a callback function that gets notified when the lock elapses and the conn manager is about to shut the connection. They can extend the lock (with some limit), or take a compensating action, e.g. initiate a connection with another bitswap peer.

Really, bitswap and friends should be setting write deadlines. What was your use-case for the time lock?

A "may I delete this" callback would be awesome for the DHT. That is, the DHT will be able to stop the connection manager from completely emptying a routing bucket.

Instead of the current read-write tag system, my current thinking involves protocols being able to attach private structs that implement a Score() int64 interface.

To connections or streams?

So, I started this off the wrong way. Let's talk use-cases:

DHT

Currently, the DHT leaves streams open (because negotiating a stream is expensive compared to a single request). With multistream 2.0 (or any sane multistream replacement), this should be pretty much free.

So, assuming we open a new stream for every request:

An open stream means we care about the connection.
We need some way to mark peers in our routing table.
Ideally, we'd have some way to prevent the routing table from being thinned too much (this is where I think the idea of a callback comes in).

Ping

Ping only cares about a connection when a stream is open.

Identify

Really, identify doesn't care at all. Identify would want a way to create a stream and immediately release the lock.

Relay

Relay start/stop will want to lock the connection open. However, relay hop will need some metric (based on the number of connections being relayed). Ideally, as you say, this would be computed on-demand.

Bitswap

So, we this is a case where we might want to open a stream to track the session. On the other hand, I believe the original vision for bitswap had it keep sessions open even through disconnects...

Otherwise,

An open stream means keep the connection open (well, not now because we leave them open but in an ideal world where we didn't).
After receiving a request, bitswap needs some way to temporarily "lock" the connection (if we actually have the block in question). Honestly, this might best be served with an actual lock handle (something that can be closed).
After fulfilling a request, bitswap would, ideally, be able to mark the connection as "likely to be used in the near future". This is where my decay idea came in. The more time passes, the less useful the connection (well, peer, not connection) is.

ipfs p2p

An open stream locks the connection open.
Really, we want to set the priority to max.

pubsub

This depends on the protocol but, e.g., in gossipsub, we want to lock open the connections to our grafted peers and try to keep open connections to gossip peers.

From this, I agree with your "associate a custom struct" approach. Really, I'd like to be able to associate a struct with a peer, a connection, or a stream.

Actually, the custom struct approach is probably sufficient. That is, if we provide some way to access/modify these scorers and provide some sentinel priority that locks the connection open, we should be able to implement everything without having to put too much logic into the connection manager itself.

For example, take the following reflection happy version:

type StreamScorer interface {
    ScoreStream(c Stream) int
}

type StreamScorer interface {
    ScoreConn(c Conn) int
}

type PeerScorer interface {
    ScoreConn(p peer.ID) int
}

type ConnectionManager interface {
    Scorer

    // ManageConn takes a connection to manage and a callback to update
    // a scorer associated with the connection.
    //
    // fn is a *function* of the form `func(*MyScorer) { ... }`
    //
    // While called, it holds an exclusive lock on the scorer and can
    // manipulate it freely.
    //
    // If a scorer of the appropriate type has not yet been associated with
    // the connection, a new one with a zero value will be created.
    ManageConn(c Conn, fn interface{})
    ManageStream(s Stream, fn interface{})
    ManagePeer(p peer.ID, fn interface{})
}

Internally, this would keep a map of ScorerType -> Scorer per connection, stream, peer. Technically, we could go full recursion and associate a scorer-scorer with the network, have it internally track per-conn scorers that track per stream scorers all the way down... but let's not.

Whatever, that's probably going too far off into reflection land. However, this does show that the you're right, the pull approach is probably the right one.

vyzo commented 6 years ago

We also have to consider the kind of complexity we are introducing: time locking. callbacks, oh my. Let's not make it impossible to program with sanely.

Can't we make it work with just weights? These are nice and simple.

Stebalien commented 6 years ago

The problem with just weights is that we have to update them eagerly. This is currently a real pain to manage properly.

I agree we want to keep it as simple as possible (at the core at least). Registering stream "scorers" should keep it reasonably simple while making it flexible.

Note: There are definitely simpler mechanisms than my (insane) reflection proposal. A simple alternative is to:

Register "scorers" that apply to all streams.
Provide a way to associate metadata with streams, connections, peers.

That is,

type ConnManager interface {
   RegisterStreamScorer(StreamScorer)
   // ...
}

type Stream interface {
    // ...
    SetMeta(key, value)
}

Really, we don't even have to provide a way to associate metadata with streams. It's just easier than having everyone independently track streams using open/close events.

Stebalien commented 6 years ago

The problem with just weights is that we have to update them eagerly. This is currently a real pain to manage properly.

To make this less of a blank assertion, I ran into two cases where the current connection manager interface frustrated me to the point where I just gave up:

In the DHT, I only marginally care about peers in the routing table until we get to some critical lower-bound. At that point, I really start to care about these nodes. A pull system makes it possible to re-adjust weights on the fly based on how many connections have been cut.
In bitswap, my expectation that a peer will want something from me decays over time after the last block I sent them. We could implement some eager service in bitswap to update these decaying weights but that's significantly more complicated than just giving the connection manager some function/scorer that calculates these on demand.

@vyzo but really, thank you for pushing for the simple solution and please continue to do that whenever I try to float off into the meta-programming abstraction ether.

thomas92911 commented 6 years ago

Suggestion, Connections should be divided into two categories

One type is the connection managed by the existing conn-manager.

The other type is a user-defined connection, which should have the following properties.

Can stay connected for a long time
Automatically reconnect after disconnected
Persistent storage connection information, connect at startup

Business applications typically have their own stable nodes, so it's best to have a way to maintain a stable connection between these nodes, and existing connection manager-managed connections can help these fixed nodes provide services.

Stebalien commented 5 years ago

@thomas92911 that's the idea behind tagging. You can tag a connection with a really high weight to keep it (although we should probably add a "don't close ever" tag).

thomas92911 commented 5 years ago

@thomas92911 that's the idea behind tagging. You can tag a connection with a really high weight to keep it (although we should probably add a "don't close ever" tag).

that's great.

anacrolix commented 5 years ago

Consider this algorithm. It reduces the problem to the desired goal: What connection to evict when we have to evict. I still think that "scoring" connections is adding extra state and concepts that are superfluous to this goal. I made the algorithm a gist in case people want to pick out issues or concerns. Some advantages of a stateless algorithm include: significantly less complexity, and significantly more testability. It's also much easier to manipulate the parameters that make the decisions.

libp2p / go-libp2p

Improved Connection Manager #475