guillaumemichel commented 1 year ago

ETA: 2023Q1

TL;DR

What we have today

There is currently a function verifying that the peer that is about to be added to the Routing Table (RT) advertises the right DHT protocol. However, it doesn't verify whether the peer actually reply to DHT requests before adding it to the RT.

What is the problem

Misconfigured or misbehaving peers could advertise that they speak the right DHT protocol, but not answer to DHT requests as expected. These nodes may end up in the RT of well behaving peers, and upon request of a key that is close to a misbehaving peer, the well behaving peer will return the peerid of the misbehaving peer. The initial requester will then query the unresponsive peer and time out. The propagation of unresponsive peers slows down the lookup process.

How to fix it

Before adding a remote peer to its RT, a node should verify that this remote peer is able to answer DHT queries correctly. It can verify this by making a DHT query to this peer (if not done before).

What is the expected impact

Nodes that have applied the patch no longer have unresponsive DHT servers in their RT, and they don't spread them anymore. However they can still be victim of unresponsive DHT server propagation from outdated peers. The DHT is expected to get faster as more peers apply the patch.

Checklist

[x] Consensus on the changes to perform
[x] Implement changes
[x] Document the change at the relevant location
[x] Merge the change

Adding peers to the RT

New peers are added to the DHT using the TryAddPeer function only. Note that the queryPeer bool variable is only useful for setting LastUsefulAt, not for knowing whether the peer is a valid DHT server.

The TryAddPeer function is only called from the rtPeerLoop function.

https://github.com/libp2p/go-libp2p-kad-dht/blob/2b85cfc0b1e5372bb36cb3ed4c82321054b1f055/dht.go#L590-L597

The only writer to the dht.addPeerToRTChan chan is the peerFound function.

https://github.com/libp2p/go-libp2p-kad-dht/blob/2b85cfc0b1e5372bb36cb3ed4c82321054b1f055/dht.go#L625-L659

The only check performed by peerFound before adding the peer to the RT is calling validRTPeer. This function only verify whether the remote peer advertises that it speaks the DHT protocol. But it doesn't verify that the remote peer is actually responsive to DHT requests. This check isn't enough.

https://github.com/libp2p/go-libp2p-kad-dht/blob/2b85cfc0b1e5372bb36cb3ed4c82321054b1f055/subscriber_notifee.go#L145-L155

We will list the callers of peerFound and explain how they add peers to the RT.

`New(...) (*IpfsDHT, error)`

https://github.com/libp2p/go-libp2p-kad-dht/blob/2b85cfc0b1e5372bb36cb3ed4c82321054b1f055/dht.go#L234-L239

When creating the DHT, the node tries to add all connected peers to its RT. Because of validRTPeer, only the peers advertising the DHT protocol will get added.

In this case, it is important to make a DHT query (for a any key) to all remote peers that are about to be added to the RT, and only add them to the RT if they answer without error to the DHT query.

`fixLowPeers`

https://github.com/libp2p/go-libp2p-kad-dht/blob/2b85cfc0b1e5372bb36cb3ed4c82321054b1f055/dht.go#L475-L480

Similar to the case above.

`handleNewMessage`

https://github.com/libp2p/go-libp2p-kad-dht/blob/2b85cfc0b1e5372bb36cb3ed4c82321054b1f055/dht_net.go#L113-L114

If a peer sent us a DHT request (and advertises being a DHT server), add it to the RT.

Before adding these peers to the RT, make sure that they answer correctly to DHT queries

`queryPeer`

https://github.com/libp2p/go-libp2p-kad-dht/blob/2b85cfc0b1e5372bb36cb3ed4c82321054b1f055/query.go#L407-L420

Upon successful DHT request, the remote peer is added to the RT.

No further action is required, as the node has proven to answer DHT queries

DHT requests upon RTRefresh

In addition to the proposed fix, it is possible to periodically make DHT request to all nodes in the RT to make sure they are still responsive to DHT queries. This operation however is more expensive, as 1 additional query is sent every 10 minutes for each RT member.

https://github.com/libp2p/go-libp2p-kad-dht/pull/810

A probabilistic approach can decrease the network load of additional DHT requests. At every refresh, the node only sends DHT requests to a fraction of the nodes inside a bucket, for all buckets. If it detects some unresponsive nodes, the fraction of peers queried in each bucket is increased for the next refresh. If it doesn't detect any unresponsive nodes for a while, the fraction of peers queried decreases at the next refresh, but never goes below a magic threshold.

This technique allows a low overhead if the network is behaving correctly. And if a significant share of the network acts in an adversarial way, the unresponsive DHT nodes are detected and removed from the RT, at the price of a higher network load.

IMO periodically verifying if nodes in the RT still answer to DHT queries isn't needed (at least for now). Preventing unresponsive nodes from being added to the RT in the first place should be sufficient.

yiannisbot commented 1 year ago

We're mostly interested in large-scale incidents here and not the random node that happens to not be responsive. From this point of view, statistics can go a long way :) I'd say starting with a sample of 10%-20% of nodes in the routing table would do the trick, IMO.

dennis-tra commented 1 year ago

Actually, wouldn't it suffice to do a query similar to what we added in #810 inside the peerFound function if the queryPeer flag is false? This would definitely have caught the unresponsive nodes we observed recently. I'm not sure if this would break any separation of concerns inside the code base - I'd need another look through the code. Just wanted to leave it here as food for thought.

guillaumemichel commented 1 year ago

@dennis-tra in the case where a remote peer sends a DHT query, we don't know whether it can actually answer DHT queries (it must advertise the DHT protocol though), and the flag is still set to true.

https://github.com/libp2p/go-libp2p-kad-dht/blob/2b85cfc0b1e5372bb36cb3ed4c82321054b1f055/dht_net.go#L113-L114

I think that the easiest way to fix it is to add a flag to the peerFound function, and propagate it to the validRTPeer where the same check as in https://github.com/libp2p/go-libp2p-kad-dht/issues/810 is performed. I'll work on it :)

dennis-tra commented 1 year ago

in the case where a remote peer sends a DHT query, we don't know whether it can actually answer DHT queries (it must advertise the DHT protocol though), and the flag is still set to true.

Sure :+1: I just thought that this would have still caught the unresponsive peers. The peers were unresponsive because they couldn't open a new stream for the DHT protocol. If we received a query this would have proven that they still can open a stream. However, as you said, this doesn't prove that they respond to queries.

Maybe there is a solution without passing an additional boolean flag through the stack? I'm asking because this sounds like the function could be split. Anyways, I'm sure you'll see how it's done better :)

BigLep commented 1 year ago

FYI @guillaumemichel that @Jorropo will be your Kubo-maintainer point of contact for supporting you to get these landed.

yiannisbot commented 1 year ago

@Jorropo @BigLep do we have an ETA for reviewing this and getting it merged? It's one of the few remaining action items from the incidents earlier in the year and one of the milestones we've committed to ship by end of 2023Q1.

BigLep commented 1 year ago

@yiannisbot : this isn't making it in for 0.20 :(. It will happen after IPFS Thing for 0.21 (release in later may).

guillaumemichel commented 1 year ago

https://github.com/libp2p/go-libp2p-kad-dht/pull/820 has been merged

libp2p / go-libp2p-kad-dht

Don't add unresponsive DHT servers to the RT #811

TL;DR

What we have today

What is the problem

How to fix it

What is the expected impact

Checklist

Adding peers to the RT

`New(...) (*IpfsDHT, error)`

`fixLowPeers`

`handleNewMessage`

`queryPeer`

DHT requests upon RTRefresh