status-im / nimbus-eth2

Nim implementation of the Ethereum Beacon Chain
https://nimbus.guide
Other
543 stars 233 forks source link

[SEC] PeerPool - add non-deterministic component to peer selection #1525

Open tintinweb opened 4 years ago

tintinweb commented 4 years ago

Description

peer_pool.GetItem appears to implement a peer selection strategy where peers that are added first to the pool have a higher priority (lower PeerIndex). The pool is partitioned into an incoming and an outgoing peer queue. The queue itself is a HeapQueue sorted by PeerIndex with the lowest PeerIndex being item 0 - or the default (see References).

Peers with a lower peer-index will be utilized more than peers with a higher index. The selection is somewhat deterministic if the peers are known and can be influenced by an external peer.

A peer's score is not part of the peer selection criteria which might result in a less trusted peer to still be selected more often than a more trusted or fresh peer. The peers sync status is also not part of the peer selection strategy which may lead to increased roundtrips when asking for specific blocks)

The peer selection filter may not be useful at all: when to choose an incoming over an outgoing peer when the selection criteria might more be based on a random sampling of a pool of trusted and synced non-stale nodes.

Exploit Scenario

A malicious peer might force itself into a higher priority position (one of the first nodes added to the pool) and, therefore, influence how often the node will interact with the peer and take advantage of this in an effort to isolate the node. In the - unlikely - worst case, a malicious incoming peer might become the default node to be returned by the peer selection even though it is less reliable or trusted (reputation)

Mitigation Recommendation

It is recommended to implement a less-deterministic peer selection component into the overall peer selection strategy:

Note: Consider falling back to peer discovery mode instead of shutting down when no peer can be acquired. doAssert may be problematic for recoverable states as this might be exploited to boot the node off the network (e.g. someone flooding the peer pool just to empty it when peer[0] is already acquired).

References

https://github.com/status-im/nim-beacon-chain/blob/79ff4f7c41eeec0d94fc9b357bb128ee9e3eaf1b/beacon_chain/peer_pool.nim#L118-L145

https://github.com/status-im/nim-beacon-chain/blob/79ff4f7c41eeec0d94fc9b357bb128ee9e3eaf1b/beacon_chain/peer_pool.nim#L284-L296

cheatfate commented 4 years ago

@tintinweb it looks like you got confused by the code. The core idea of peer_pool.nim is to provide peers sorted by Peer's score. So peers with highest score will be acquired first. Nim's HeapQueue is highly dependent on how defined < operation for type stored in HeapQueue. We know that PeerIndex are stored in HeapQueue, and < operation for PeerIndex was defined https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/peer_pool.nim#L57-L60

proc `<`*(a, b: PeerIndex): bool =
  ## PeerIndex ``a`` holds reference to ``cmp()`` procedure which has captured
  ## PeerPool instance.
  a.cmp(b, a)

Every PeerIndex holds reference to this cmp procedure https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/peer_pool.nim#L29

  PeerIndex = object
    data: int
    cmp: proc(a, b: PeerIndex): bool {.closure, gcsafe.}

When PeerIndex object got created it is assigned with custom cmp procedure. https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/peer_pool.nim#L333 https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/peer_pool.nim#L625 https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/peer_pool.nim#L641 https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/peer_pool.nim#L657

So every PeerIndex object holds reference to this comparison procedure peerCmp: https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/peer_pool.nim#L186-L191

  proc peerCmp(a, b: PeerIndex): bool {.closure, gcsafe.} =
    let p1 = res.storage[a.data].data
    let p2 = res.storage[b.data].data
    p1 < p2
  res.cmp = peerCmp

As you can see actually compares data argument which is of type A, in our case it instantiated with type Peer which is defined here https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/eth2_network.nim#L95-L105

  Peer* = ref object
    network*: Eth2Node
    info*: PeerInfo
    discoveryId*: Eth2DiscoveryId
    connectionState*: ConnectionState
    protocolStates*: seq[RootRef]
    maxInactivityAllowed*: Duration
    netThroughput: AverageThroughput
    score*: int
    connections*: int
    disconnectedFut: Future[void]

And appropriate comparison procedure is defined here https://github.com/status-im/nim-beacon-chain/blob/master/beacon_chain/eth2_network.nim#L357-L365

proc `<`*(a, b: Peer): bool =
  ## Comparison function, which first checks peer's scores, and if the peers'
  ## score is equal it compares peers' network throughput.
  if a.score < b.score:
    true
  elif a.score == b.score:
    (a.netThroughput.average < b.netThroughput.average)
  else:
    false

In such case all HeapQueue sorts PeerIndex according Peer's score which PeerIndex represents in PeerPool's storage.

cheatfate commented 4 years ago

One more non-deterministic component was added while code was in audit, right now we measure network throughput with particular peer, and it is also added to peer's comparison function.

tintinweb commented 4 years ago

You're right, thanks for the clarification @cheatfate 🙌 .

Would be great to get your thoughts on the other recommendations as well. I think the main idea would be to harden the peer selection strategy, make it even less deterministic (and the throughput might already be a way to address that even though it can pot. be influenced by a peer) to avoid that high score (malicious) nodes get a chance to dominate the selection (e.g. by manipulating their score with benign valid reputation increasing requests; high score node being only just cooperative, ...). Adding a random sub-selection or peer replacement strategy (even though their score is good) could make this more resistant to attacks.

  • A peer should not be in control of the priority in the peer selection they may have for a certain peer
  • Randomly select from Initial Peer list (maybe considered more trusted and contain static peers provided by the user) and from Peers based on their reputation.
  • Continue searching for new peers every now and then even though the pool is full
  • Implement a peer replacement strategy by select peers to be removed from the pool and replace them with fresh peers to avoid being stuck in a subgraph of the p2p network (e.g. lower reputation and stale peers, lower head_count (within range) first). Atm peers only get removed when the connection is closed.
  • Remove means to control from which queue the peers should be selected (inc/outgoing) and return a random pick from the complete pool.
  • Reserve part of the peer pool to peers that are at the same or higher block-height to ensure that the node can get in sync with the network (scenario: peer pool is full of peers that are behind; our own node will provide them with blocks but cannot catch-up to get in sync on its own).
cheatfate commented 4 years ago

First of all, I would like to say that during the audit we changed the synchronization strategy. The original version worked according to the principle of using single peer as long as possible, new version changing peers at every iteration (where iteration is request blocks from peer, verify and store blocks).

Initial syncing strategy (which was audited)

  let peer = peerPool.acquire()
  while notInSync:
    let blocks = peer.getBeaconBlocksByRange()
    if not verify(blocks):
      peer.penalize()
    else:
      peer.reward()
    if peer.score < lowValue:
      break
  peerPool.release(peer)

Current syncing strategy.

  while notInSync:
    let peer = peerPool.acquire()
    let blocks = peer.getBeaconBlocksByRange()
    if not verify(blocks):
      peer.penalize()
    else:
      peer.reward()
    peerPool.release(peer)

Because of such change PeerPool will be sorted by peer's score much more faster.

Randomly select from Initial Peer list (maybe considered more trusted and contain static peers provided by the user) and from Peers based on their reputation.

First of all all the peers in network could not be "more" or "less" trusted, every peer is "untrusted" until it not starts responding, as soon as peer responds with non-empty responses we will reward or penalize it's score.

We shouldn't have favorites in distributed network of peers, because when we not in sync with network we can't check if favorites (initial/static peer list) following the main fork and not following side fork.

So using constantly changing source of peers could give us better chance to sync faster.

Continue searching for new peers every now and then even though the pool is full

We perform searching of new peers all the time, even if PeerPool is full of already established connections. https://github.com/status-im/nim-eth/blob/master/eth/p2p/discoveryv5/protocol.nim#L731-L754 Our own discovery loop just obtains current random snapshot of peers from discovery service https://github.com/status-im/nimbus-eth2/blob/master/beacon_chain/eth2_network.nim#L804

Implement a peer replacement strategy by select peers to be removed from the pool and replace them with fresh peers to avoid being stuck in a subgraph of the p2p network (e.g. lower reputation and stale peers, lower head_count (within range) first). Atm peers only get removed when the connection is closed.

First of all peers got removed from PeerPool at two cases: 1) When remote peer drops connection (connection is closed). 2) When remote peer's score reach lowest possible value (peer got kicked).

Similar approach is now used in our syncing strategy, we using different peers for every iteration of syncing. Number of workers used for syncing will be always lower then number of peers in Pool, in such way there will be always present some number of fresh peers to choose from. But replacing good-scored peers with "unknown" peers doesn't look like a good turn.

When we handshaking with peer we exchange status information with this peer, at this moment we obtain peer's latest head information. Right now if peer is not part of syncing process it's latest head information is not get refreshed (but we going to change it soon), so there present some time lag between real peer's status and status we know.

But we already properly handle peers with low latest_head https://github.com/status-im/nimbus-eth2/blob/devel/beacon_chain/sync_manager.nim#L759. So if peer can't fix this lag it will be soon kicked out from PeerPool.

cheatfate commented 4 years ago

Remove means to control from which queue the peers should be selected (inc/outgoing) and return a random pick from the complete pool.

I understand why we should have more outgoing connections than incoming. But if incoming peer has more peer's score then outgoing peer (it means that incoming peer responses are considered more useful) why we need to choose outgoing peer if its score is lower?

Also if incoming peer will respond us with incorrect blocks it will be soon removed from PeerPool and connection will be dropped.

Reserve part of the peer pool to peers that are at the same or higher block-height to ensure that the node can get in sync with the network (scenario: peer pool is full of peers that are behind; our own node will provide them with blocks but cannot catch-up to get in sync on its own).

Usually, until you will not start requesting blocks from peer you can't be sure that this peer is able to satisfy your request. Also malicious peers could easily falsify status response, and the biggest problem that we could not verify status response until we are not in sync with network.

So we could maintain such part only with assumption that all the peers which only responded with status response are all "good" peers.