status-im / nimbus-eth1

Nimbus: an Ethereum Execution Client for Resource-Restricted Devices
https://status-im.github.io/nimbus-eth1/
Apache License 2.0
568 stars 115 forks source link

Use data radius to prioritize requests for recursiveFindContent #2509

Closed kdeme closed 1 week ago

kdeme commented 3 months ago

When starting a recursiveFindContent request, it first looks up the closest nodes in the local routing table. Then continiously trying to contact closer nodes.

However, data radius of the nodes also plays a role here. We could prioritize on nodes for which the data radius actually falls in the target range (if we know the data radius of the node). It is however to be seen if this added complexity is worth it in terms of amount of nodes we need to contact on average before we find the content.

bhartnett commented 1 week ago

I've discovered a scenario on the portal mainnet state network where Fluffy is unable to find the following content:

{
    "jsonrpc": "2.0",
    "method": "portal_stateGetContent",
    "id": 200,
    "params": ["0x20240000004e1f64763876ab27c76284439deba305344fa840821995977883f89de42f29431c62"]
}

When testing with Trin I can get the content back very quickly.

It looks like the content lookup is failing to return nodes that respond and fall within the radius. I believe we should improve the lookup to only search for nodes that have been seen and that have a radius that covers the content being searched.

kdeme commented 1 week ago

I believe we should improve the lookup to only search for nodes that have been seen and that have a radius that covers the content being searched.

Not sure what exactly you mean here, but regarding the part of using the "only seen" nodes: you can practically apply this already on the starting node list today as it is a flag. This might be a good change indeed, so that we start the lookup from a certain to exist list of nodes. But to be tested if this is really the issue.

However, you cannot apply it on further returned nodes in the lookup and you also cannot only contact nodes that have the radius that covers the content. Not sure if this exactly what you meant with you above statement, because if you do this, a search can die out fairly quickly if your local node is located far away (in xor distance) from the content id.

kdeme commented 1 week ago

fyi: it is definitely not a consistent failure and might be very NodeId + routing table filling + geographic location dependent. I have tested it myself 10x, each time on a fresh node with new node id running literally for 10 seconds, and each time the request worked for me.

bhartnett commented 1 week ago

I believe we should improve the lookup to only search for nodes that have been seen and that have a radius that covers the content being searched.

Not sure what exactly you mean here, but regarding the part of using the "only seen" nodes: you can practically apply this already on the starting node list today as it is a flag. This might be a good change indeed, so that we start the lookup from a certain to exist list of nodes. But to be tested if this is really the issue.

Yes I found that flag. I was thinking to filter out the nodes by setting seen == true and then only query nodes in the radius cache that are in range of the content.

However, you cannot apply it on further returned nodes in the lookup and you also cannot only contact nodes that have the radius that covers the content. Not sure if this exactly what you meant with you above statement, because if you do this, a search can die out fairly quickly if your local node is located far away (in xor distance) from the content id.

Interesting. Yes, I actually did something like this. I tested and it did help me find the content successfully but I'm not planning to go ahead with the change at least in that form. Probably a better change would be to do some sorting on the initial list of nodes so that we query nodes that have a radius that covers the content first.

bhartnett commented 1 week ago

fyi: it is definitely not a consistent failure and might be very NodeId + routing table filling + geographic location dependent. I have tested it myself 10x, each time on a fresh node with new node id running literally for 10 seconds, and each time the request worked for me.

Good to know. I was able to get the data back using Trin on the same network so it is odd. I wonder if my Fluffy enr is just really far from the content.

bhartnett commented 1 week ago

@kdeme What did you mean by this? 'We could prioritize on nodes for which the data radius actually falls in the target range (if we know the data radius of the node)'

Just curious to understand further, what your suggestion was for this task.

kdeme commented 1 week ago

Just curious to understand further, what your suggestion was for this task.

The target = requested content. Thus prioritized requests to nodes that are likely to have the content, based on their radius.

Yes I found that flag. I was thinking to filter out the nodes by setting seen == true and then only query nodes in the radius cache that are in range of the content.

I don't think it is a good idea to only query those nodes. I think they should be prioritized, which is what this issue is about, but if they turn out useless results, other nodes should still be queried as they can still provided useful peers.

edit: to further comment on this. It is an optimization, to avoid additional hops in getting to the right peers. It should not be needed to "fix" a current content request failure.

bhartnett commented 1 week ago

Yes I found that flag. I was thinking to filter out the nodes by setting seen == true and then only query nodes in the radius cache that are in range of the content.

I don't think it is a good idea to only query those nodes. I think they should be prioritized, which is what this issue is about, but if they turn out useless results, other nodes should still be queried as they can still provided useful peers.

Thanks, I see what you mean. Actually my second attempt to improve the query did sorting rather than filtering out so I was headed in the same direction as what you suggest.

edit: to further comment on this. It is an optimization, to avoid additional hops in getting to the right peers. It should not be needed to "fix" a current content request failure.

Well I believe it is a problem related to making our implementation more tolerant to failing nodes in our routing table so I was trying to fix the scenario when the first 16 nodes all fail to respond or have a radius outside the content. I was thinking of doing something like this:

  var closestNodes =
    p.routingTable.neighbours(targetId, p.routingTable.len(), seenOnly = false)

  # Shuffling the order of the nodes in order to not always hit the same node
  # first for the same request.
  p.baseProtocol.rng[].shuffle(closestNodes)

  proc nodesCmp(x, y: Node): int =
    let
      xRadius = p.radiusCache.get(x.id)
      yRadius = p.radiusCache.get(y.id)

    if xRadius.isSome() and p.inRange(x.id, xRadius.unsafeGet(), targetId):
      -1
    elif yRadius.isSome() and p.inRange(y.id, yRadius.unsafeGet(), targetId):
      1
    else:
      0

  closestNodes.sort(nodesCmp)

This basically prioritizes nodes that are within the radius of the content but otherwise the order of the nodes should be unchanged. I did notice the query performance improve after this change but I'm not sure if it will create other issues. In general I think we should think of a way to filter out nodes that are not responding or sending empty responses as well as aiming to query nodes having the correct radius range as you said.

kdeme commented 1 week ago

So there are several items being discussed here:

  1. Decreasing the amount of hops by prioritizing nodes with data in radius. This is what this issue is mainly about, and it is seems reasonable to me that this would be worth the performance increase.
  2. The issue of all 16 starting nodes not responding (I don't think I have ever encountered this), this is an issue with the health of the network. But nevertheless, it can occur (e.g. also malicious) and we need to be resilient against it. However increasing the amount of initial nodes for this reason is a workaround and not a proper fix. We should have ways (e.g. banlist/peer scoring) to avoid having those "bad" nodes in the routing table in the first place.
  3. The possible performance improvement by increasing the amount of starting nodes (up to the whole routing table possibly) to hit the nodes with a bigger than usual radius. In traditional Kademlia networks, there isn't really this concept of radius. It is just the closest nodes that will have the data and that are to be used to get the data from. This also makes the network more balanced in terms of load. Our addition of radius, makes this different, and might urge one to do some optimizations by trying to always find and use these nodes. The simple optimization (the first item) does not induce a big cost I think, for potential good returns. So this makes sense to me. However, for this 3th item, it should be noted that 1) for the network health, it is better not to hammer 1 big radius node. and 2) this is mostly effective in this specific case where a bigger radius node would have avoided an extra request. In the other case, which should be the more common case, this will actually introduce additional cpu cost and delay (although less noticable, and especially now with small scale networks). I think this type of performance increase with big radius nodes should be tackled at specs/protocol level and I believe there is already an open issue for this.
bhartnett commented 1 week ago

Changes for this implemented in this PR: https://github.com/status-im/nimbus-eth1/pull/2841