Open tintinweb opened 4 years ago
@tintinweb it looks like you got confused by the code.
The core idea of peer_pool.nim
is to provide peers sorted by Peer's score. So peers with highest score will be acquired first.
Nim's HeapQueue is highly dependent on how defined <
operation for type stored in HeapQueue.
We know that PeerIndex
are stored in HeapQueue, and <
operation for PeerIndex
was defined
https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/peer_pool.nim#L57-L60
proc `<`*(a, b: PeerIndex): bool =
## PeerIndex ``a`` holds reference to ``cmp()`` procedure which has captured
## PeerPool instance.
a.cmp(b, a)
Every PeerIndex holds reference to this cmp
procedure https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/peer_pool.nim#L29
PeerIndex = object
data: int
cmp: proc(a, b: PeerIndex): bool {.closure, gcsafe.}
When PeerIndex
object got created it is assigned with custom cmp
procedure.
https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/peer_pool.nim#L333
https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/peer_pool.nim#L625
https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/peer_pool.nim#L641
https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/peer_pool.nim#L657
So every PeerIndex
object holds reference to this comparison procedure peerCmp
:
https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/peer_pool.nim#L186-L191
proc peerCmp(a, b: PeerIndex): bool {.closure, gcsafe.} =
let p1 = res.storage[a.data].data
let p2 = res.storage[b.data].data
p1 < p2
res.cmp = peerCmp
As you can see actually compares data
argument which is of type A
, in our case it instantiated with type Peer
which is defined here https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/eth2_network.nim#L95-L105
Peer* = ref object
network*: Eth2Node
info*: PeerInfo
discoveryId*: Eth2DiscoveryId
connectionState*: ConnectionState
protocolStates*: seq[RootRef]
maxInactivityAllowed*: Duration
netThroughput: AverageThroughput
score*: int
connections*: int
disconnectedFut: Future[void]
And appropriate comparison procedure is defined here https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/eth2_network.nim#L357-L365
proc `<`*(a, b: Peer): bool =
## Comparison function, which first checks peer's scores, and if the peers'
## score is equal it compares peers' network throughput.
if a.score < b.score:
true
elif a.score == b.score:
(a.netThroughput.average < b.netThroughput.average)
else:
false
In such case all HeapQueue
sorts PeerIndex according Peer's score which PeerIndex represents in PeerPool's storage.
One more non-deterministic component was added while code was in audit, right now we measure network throughput with particular peer, and it is also added to peer's comparison function.
You're right, thanks for the clarification @cheatfate 🙌 .
Would be great to get your thoughts on the other recommendations as well. I think the main idea would be to harden the peer selection strategy, make it even less deterministic (and the throughput might already be a way to address that even though it can pot. be influenced by a peer) to avoid that high score (malicious) nodes get a chance to dominate the selection (e.g. by manipulating their score with benign valid reputation increasing requests; high score node being only just cooperative, ...). Adding a random sub-selection or peer replacement strategy (even though their score is good) could make this more resistant to attacks.
A peer should not be in control of the priority in the peer selection they may have for a certain peer- Randomly select from Initial Peer list (maybe considered more trusted and contain static peers provided by the user) and from Peers based on their reputation.
- Continue searching for new peers every now and then even though the pool is full
- Implement a peer replacement strategy by select peers to be removed from the pool and replace them with fresh peers to avoid being stuck in a subgraph of the p2p network (e.g. lower reputation and stale peers, lower head_count (within range) first). Atm peers only get removed when the connection is closed.
- Remove means to control from which queue the peers should be selected (inc/outgoing) and return a random pick from the complete pool.
- Reserve part of the peer pool to peers that are at the same or higher block-height to ensure that the node can get in sync with the network (scenario: peer pool is full of peers that are behind; our own node will provide them with blocks but cannot catch-up to get in sync on its own).
First of all, I would like to say that during the audit we changed the synchronization strategy. The original version worked according to the principle of using single peer as long as possible, new version changing peers at every iteration (where iteration is request blocks from peer, verify and store blocks).
Initial syncing strategy (which was audited)
let peer = peerPool.acquire()
while notInSync:
let blocks = peer.getBeaconBlocksByRange()
if not verify(blocks):
peer.penalize()
else:
peer.reward()
if peer.score < lowValue:
break
peerPool.release(peer)
Current syncing strategy.
while notInSync:
let peer = peerPool.acquire()
let blocks = peer.getBeaconBlocksByRange()
if not verify(blocks):
peer.penalize()
else:
peer.reward()
peerPool.release(peer)
Because of such change PeerPool will be sorted by peer's score much more faster.
First of all all the peers in network could not be "more" or "less" trusted, every peer is "untrusted" until it not starts responding, as soon as peer responds with non-empty responses we will reward or penalize it's score.
We shouldn't have favorites in distributed network of peers, because when we not in sync with network we can't check if favorites (initial/static peer list) following the main fork and not following side fork.
So using constantly changing source of peers could give us better chance to sync faster.
We perform searching of new peers all the time, even if PeerPool
is full of already established connections.
https://github.com/status-im/nim-eth/blob/master/eth/p2p/discoveryv5/protocol.nim#L731-L754
Our own discovery loop just obtains current random snapshot of peers from discovery service
https://github.com/status-im/nimbus-eth2/blob/master/beacon_chain/eth2_network.nim#L804
First of all peers got removed from PeerPool at two cases: 1) When remote peer drops connection (connection is closed). 2) When remote peer's score reach lowest possible value (peer got kicked).
Similar approach is now used in our syncing strategy, we using different peers for every iteration of syncing. Number of workers used for syncing will be always lower then number of peers in Pool, in such way there will be always present some number of fresh peers to choose from. But replacing good-scored peers with "unknown" peers doesn't look like a good turn.
When we handshaking with peer we exchange status information with this peer, at this moment we obtain peer's latest head information. Right now if peer is not part of syncing process it's latest head information is not get refreshed (but we going to change it soon), so there present some time lag between real peer's status and status we know.
But we already properly handle peers with low latest_head https://github.com/status-im/nimbus-eth2/blob/devel/beacon_chain/sync_manager.nim#L759. So if peer can't fix this lag it will be soon kicked out from PeerPool.
I understand why we should have more outgoing connections than incoming. But if incoming peer has more peer's score then outgoing peer (it means that incoming peer responses are considered more useful) why we need to choose outgoing peer if its score is lower?
Also if incoming peer will respond us with incorrect blocks it will be soon removed from PeerPool and connection will be dropped.
Usually, until you will not start requesting blocks from peer you can't be sure that this peer is able to satisfy your request. Also malicious peers could easily falsify status response, and the biggest problem that we could not verify status response until we are not in sync with network.
So we could maintain such part only with assumption that all the peers which only responded with status response are all "good" peers.
Description
peer_pool.GetItem
appears to implement a peer selection strategy where peers that are added first to the pool have a higher priority (lowerPeerIndex
). The pool is partitioned into an incoming and an outgoing peer queue. The queue itself is aHeapQueue
sorted byPeerIndex
with the lowestPeerIndex
being item0
- or the default (see References).Peers with a lower peer-index will be utilized more than peers with a higher index. The selection is somewhat deterministic if the peers are known and can be influenced by an external peer.
A peer's score is not part of the peer selection criteria which might result in a less trusted peer to still be selected more often than a more trusted or fresh peer. The peers sync status is also not part of the peer selection strategy which may lead to increased roundtrips when asking for specific blocks)
The peer selection filter may not be useful at all: when to choose an incoming over an outgoing peer when the selection criteria might more be based on a random sampling of a pool of trusted and synced non-stale nodes.
Exploit Scenario
A malicious peer might force itself into a higher priority position (one of the first nodes added to the pool) and, therefore, influence how often the node will interact with the peer and take advantage of this in an effort to isolate the node. In the - unlikely - worst case, a malicious incoming peer might become the default node to be returned by the peer selection even though it is less reliable or trusted (reputation)
Mitigation Recommendation
It is recommended to implement a less-deterministic peer selection component into the overall peer selection strategy:
Note: Consider falling back to peer discovery mode instead of shutting down when no peer can be acquired.
doAssert
may be problematic for recoverable states as this might be exploited to boot the node off the network (e.g. someone flooding the peer pool just to empty it when peer[0] is already acquired).References
GetItem
implements the peer selection strategy. The strategy appears to be based on thePeerIndex
which represents when a peer was first added to the pool.0
is returnedstartListening
and dialing the first peer (i.e. the discovery fillingconnQueue
andconnectWorker
adding the peer connection to the pool).doAssert(PeerFlags.Acquired notin item[].flags)
forcing the node to shut down.doAssert((len(pool.outQueue) > 0) or (len(pool.incQueue) > 0))
does not sufficiently verify the respective queue is non-empty.https://github.com/status-im/nim-beacon-chain/blob/79ff4f7c41eeec0d94fc9b357bb128ee9e3eaf1b/beacon_chain/peer_pool.nim#L118-L145
addPeer
adds a new peer to the pool and is usually called byaddIncomingPeer
oraddOutgoingPeer
https://github.com/status-im/nim-beacon-chain/blob/79ff4f7c41eeec0d94fc9b357bb128ee9e3eaf1b/beacon_chain/peer_pool.nim#L284-L296