Closed guillaumemichel closed 1 year ago
We're mostly interested in large-scale incidents here and not the random node that happens to not be responsive. From this point of view, statistics can go a long way :) I'd say starting with a sample of 10%-20% of nodes in the routing table would do the trick, IMO.
Actually, wouldn't it suffice to do a query similar to what we added in #810 inside the peerFound
function if the queryPeer
flag is false
? This would definitely have caught the unresponsive nodes we observed recently. I'm not sure if this would break any separation of concerns inside the code base - I'd need another look through the code. Just wanted to leave it here as food for thought.
@dennis-tra in the case where a remote peer sends a DHT query, we don't know whether it can actually answer DHT queries (it must advertise the DHT protocol though), and the flag is still set to true.
I think that the easiest way to fix it is to add a flag to the peerFound
function, and propagate it to the validRTPeer
where the same check as in https://github.com/libp2p/go-libp2p-kad-dht/issues/810 is performed. I'll work on it :)
in the case where a remote peer sends a DHT query, we don't know whether it can actually answer DHT queries (it must advertise the DHT protocol though), and the flag is still set to true.
Sure :+1: I just thought that this would have still caught the unresponsive peers. The peers were unresponsive because they couldn't open a new stream for the DHT protocol. If we received a query this would have proven that they still can open a stream. However, as you said, this doesn't prove that they respond to queries.
Maybe there is a solution without passing an additional boolean flag through the stack? I'm asking because this sounds like the function could be split. Anyways, I'm sure you'll see how it's done better :)
FYI @guillaumemichel that @Jorropo will be your Kubo-maintainer point of contact for supporting you to get these landed.
@Jorropo @BigLep do we have an ETA for reviewing this and getting it merged? It's one of the few remaining action items from the incidents earlier in the year and one of the milestones we've committed to ship by end of 2023Q1.
@yiannisbot : this isn't making it in for 0.20 :(. It will happen after IPFS Thing for 0.21 (release in later may).
https://github.com/libp2p/go-libp2p-kad-dht/pull/820 has been merged
ETA: 2023Q1
TL;DR
What we have today
There is currently a function verifying that the peer that is about to be added to the Routing Table (RT) advertises the right DHT protocol. However, it doesn't verify whether the peer actually reply to DHT requests before adding it to the RT.
What is the problem
Misconfigured or misbehaving peers could advertise that they speak the right DHT protocol, but not answer to DHT requests as expected. These nodes may end up in the RT of well behaving peers, and upon request of a key that is close to a misbehaving peer, the well behaving peer will return the
peerid
of the misbehaving peer. The initial requester will then query the unresponsive peer and time out. The propagation of unresponsive peers slows down the lookup process.How to fix it
Before adding a remote peer to its RT, a node should verify that this remote peer is able to answer DHT queries correctly. It can verify this by making a DHT query to this peer (if not done before).
What is the expected impact
Nodes that have applied the patch no longer have unresponsive DHT servers in their RT, and they don't spread them anymore. However they can still be victim of unresponsive DHT server propagation from outdated peers. The DHT is expected to get faster as more peers apply the patch.
Checklist
Adding peers to the RT
New peers are added to the DHT using the
TryAddPeer
function only. Note that thequeryPeer
bool
variable is only useful for settingLastUsefulAt
, not for knowing whether the peer is a valid DHT server.The
TryAddPeer
function is only called from thertPeerLoop
function.https://github.com/libp2p/go-libp2p-kad-dht/blob/2b85cfc0b1e5372bb36cb3ed4c82321054b1f055/dht.go#L590-L597
The only writer to the
dht.addPeerToRTChan
chan
is thepeerFound
function.https://github.com/libp2p/go-libp2p-kad-dht/blob/2b85cfc0b1e5372bb36cb3ed4c82321054b1f055/dht.go#L625-L659
The only check performed by
peerFound
before adding the peer to the RT is callingvalidRTPeer
. This function only verify whether the remote peer advertises that it speaks the DHT protocol. But it doesn't verify that the remote peer is actually responsive to DHT requests. This check isn't enough.https://github.com/libp2p/go-libp2p-kad-dht/blob/2b85cfc0b1e5372bb36cb3ed4c82321054b1f055/subscriber_notifee.go#L145-L155
We will list the callers of
peerFound
and explain how they add peers to the RT.New(...) (*IpfsDHT, error)
https://github.com/libp2p/go-libp2p-kad-dht/blob/2b85cfc0b1e5372bb36cb3ed4c82321054b1f055/dht.go#L234-L239
When creating the DHT, the node tries to add all connected peers to its RT. Because of
validRTPeer
, only the peers advertising the DHT protocol will get added.In this case, it is important to make a DHT query (for a any key) to all remote peers that are about to be added to the RT, and only add them to the RT if they answer without error to the DHT query.
fixLowPeers
https://github.com/libp2p/go-libp2p-kad-dht/blob/2b85cfc0b1e5372bb36cb3ed4c82321054b1f055/dht.go#L475-L480
Similar to the case above.
handleNewMessage
https://github.com/libp2p/go-libp2p-kad-dht/blob/2b85cfc0b1e5372bb36cb3ed4c82321054b1f055/dht_net.go#L113-L114
If a peer sent us a DHT request (and advertises being a DHT server), add it to the RT.
Before adding these peers to the RT, make sure that they answer correctly to DHT queries
queryPeer
https://github.com/libp2p/go-libp2p-kad-dht/blob/2b85cfc0b1e5372bb36cb3ed4c82321054b1f055/query.go#L407-L420
Upon successful DHT request, the remote peer is added to the RT.
No further action is required, as the node has proven to answer DHT queries
DHT requests upon RTRefresh
In addition to the proposed fix, it is possible to periodically make DHT request to all nodes in the RT to make sure they are still responsive to DHT queries. This operation however is more expensive, as 1 additional query is sent every 10 minutes for each RT member.
https://github.com/libp2p/go-libp2p-kad-dht/pull/810
A probabilistic approach can decrease the network load of additional DHT requests. At every refresh, the node only sends DHT requests to a fraction of the nodes inside a bucket, for all buckets. If it detects some unresponsive nodes, the fraction of peers queried in each bucket is increased for the next refresh. If it doesn't detect any unresponsive nodes for a while, the fraction of peers queried decreases at the next refresh, but never goes below a magic threshold.
This technique allows a low overhead if the network is behaving correctly. And if a significant share of the network acts in an adversarial way, the unresponsive DHT nodes are detected and removed from the RT, at the price of a higher network load.
IMO periodically verifying if nodes in the RT still answer to DHT queries isn't needed (at least for now). Preventing unresponsive nodes from being added to the RT in the first place should be sufficient.