Open wllmshao opened 2 weeks ago
poc started here: https://github.com/skip-mev/cometbft/tree/ws/val-p2p
It turns out that creating a separate p2p package isn't necessary. For now, this branch contains a new reactor, inspired by the PEX reactor, but instead of PEX, it floods peers into the network so that every added peer is broadcast to all existing peers. This should create a complete network.
For now, this reactor is just added to the normal switch. Since we want to isolate this functionality to consensus messages, my next step is to see if we can create a separate switch that uses just the consensus and complete_peering reactors to create a complete network that only does consensus, while the existing switch remains incomplete and does all of the other messages.
Using a separate switch doesn't seem to work well, the 2 switches interfere and try to make duplicate connections. Removing the duplicate TCP connection filter doesn't help.
So now I'm thinking about how to make this work on the p2p code level.
What we want is for the network's subgraph of validators to be complete (i.e. all validators are connected to each other, and optionally, they are connected to full nodes with no requirements there).
This means, from a validator's perspective, they need a mechanism to discover and connect to all other validators, without just connecting to every node in the network. So we need to update the node connection process for nodes to specify that they are validators when they connect to another node. If that node is a validator, they should use the CompletePeering reactor to broadcast that node's information to the other validators, who will then connect to them.
This means
The result of this will be
1,2,3 above are implemented now. I added some full nodes to the e2e test, and it appears that the validators are all connected to each other, and the full nodes are not gossiped by the complete peering reactor.
One other thing we should do is to make sure there is some kind of persistence. If the connection between one pair of validators breaks, there isn't actually a mechanism to have them attempt to reconnect. I'm not sure if that's handled at a lower p2p level or not.
We discussed today that we don't necessarily want the validator subgraph to be complete. For large networks, a complete graph would mean a very large peer set, which would be prohibitively expensive in terms of network costs.
It would be nice instead to have a spanning tree where the distance between any two nodes has some known upper bound. Finding a p2p algorithm to do this might be involved and require some research.
Perhaps a simpler direction, for now, is to just limit the number of validator network peers. Note that the POC already maintains the MaxNumInboundPeers constraint, so it will not accept inbound connections even from validators when it has reached that limit. However, this means that non-validator nodes compete with validator nodes for the node's limited peer capacity. We probably want a lower bound of validator peers for the validators (i.e. reserve some amount of peer spots for validators, or kick out non-validator peers when validators make inbound connections), and also an upper bound (stop connecting to new validators after we've reached a certain #).
This last part may require validators to tell each other how many validator peers they have, and then when rejecting an inbound connection, give a replacement validator that has below the upper bound number of validator peers. Unsure if this part is necessary or sufficient, so will think about it more.
The result of this is that validators are guaranteed to be connected to some subset of the rest of the validators. This creates a spanning tree of validators, but does not give any obvious guarantees about the diameter of that tree. We can probably do some analysis on the bounds of the number of validator peers relative to the number of total validators in order to determine the diameter of the tree. For example, if every node has (#validators/2)+1 validator peers, then the graph is guaranteed to be connected with a diameter of 2.
(#validators/2)+1 is probably not an acceptable validator peer lower bound though, so we need some intelligent algorithms for connecting validators. Worth looking into: DHTs, random peer sampling, greedy diameter reduction, small-world networks, probablistic routing.
We need to take extra care so that validators do not end up in multiple connected subgraphs that are not connected via any validator, and that new validators connecting do not get bounced by these validators and left in the dark.
These POC improvements below also address some of the issues above:
I think we may also have a startup problem with the current POC: Suppose a validator is only connected to full nodes, then it will not be able to join the subgraph of validators until a validator connects to it directly, since none of the full nodes are participating in the validator complete graph gossip.
Also, the POC currently uses a key check on startup to determine if the node is a validator. This excludes some setups where validators are not actually exposed directly to the p2p network.
adr in progress here: https://github.com/cometbft/cometbft/pull/4462/files
Protocol Change Proposal
Summary
Our hypothesis is that one of the bottlenecks for CometBFT networks is caused by the network topology. To be exact, validators need to send consensus messages between each other, but the network topology does not provide a way for them to be directly connected. Consensus messages that arrive at validators may need to travel through other nodes before getting there, increasing the latency to reach the message saturation required to proceed to the next step. For many networks, this may be a bottleneck for lowering the block times.
The proposal is to introduce a new validator-only network where all validators are directly connected to each other, and thus can send consensus messages to each other directly, taking the theoretical shortest amount of time that the network allows to reach consensus.
Problem Definition
The existing CometBFT p2p network is agnostic to the distinction between full nodes and validators. Peers are added via config files and PEX, but validators are equally likely to connect to full nodes as they are to validators via PEX. Thus, it is most likely that validators are not directly connected to all other validators in most networks, and thus validators must take multiple hops to reach other validators.
For some validators, they are multiple hops away from some messages they need to reach consensus, i.e. in the set of messages they need to proceed past any step of consensus, some message may take intermediate nodes on the way to reaching them. This means the amount of time it takes for that node to proceed in consensus is longer than the theoretical lower bound of the network, which would be reached if all messages were just 1-hop away.
Our goal is to introduce a way for each validator to be only 1-hop away from every message it needs in order to reach consensus.
Proposal
For our first attempt to solve this, we will introduce a new p2p network, separate from the existing one, that validators join on startup. It will be optionally authenticated (meaning validators must prove their identity in order to participate). This network should be complete, i.e. every node should have a direct connection to every other node. This network should only be used to communicate consensus messages.
The steps to implement this are roughly: