Closed kdeme closed 1 week ago
I've discovered a scenario on the portal mainnet state network where Fluffy is unable to find the following content:
{
"jsonrpc": "2.0",
"method": "portal_stateGetContent",
"id": 200,
"params": ["0x20240000004e1f64763876ab27c76284439deba305344fa840821995977883f89de42f29431c62"]
}
When testing with Trin I can get the content back very quickly.
It looks like the content lookup is failing to return nodes that respond and fall within the radius. I believe we should improve the lookup to only search for nodes that have been seen and that have a radius that covers the content being searched.
I believe we should improve the lookup to only search for nodes that have been seen and that have a radius that covers the content being searched.
Not sure what exactly you mean here, but regarding the part of using the "only seen" nodes: you can practically apply this already on the starting node list today as it is a flag. This might be a good change indeed, so that we start the lookup from a certain to exist list of nodes. But to be tested if this is really the issue.
However, you cannot apply it on further returned nodes in the lookup and you also cannot only contact nodes that have the radius that covers the content. Not sure if this exactly what you meant with you above statement, because if you do this, a search can die out fairly quickly if your local node is located far away (in xor distance) from the content id.
fyi: it is definitely not a consistent failure and might be very NodeId + routing table filling + geographic location dependent. I have tested it myself 10x, each time on a fresh node with new node id running literally for 10 seconds, and each time the request worked for me.
I believe we should improve the lookup to only search for nodes that have been seen and that have a radius that covers the content being searched.
Not sure what exactly you mean here, but regarding the part of using the "only seen" nodes: you can practically apply this already on the starting node list today as it is a flag. This might be a good change indeed, so that we start the lookup from a certain to exist list of nodes. But to be tested if this is really the issue.
Yes I found that flag. I was thinking to filter out the nodes by setting seen == true
and then only query nodes in the radius cache that are in range of the content.
However, you cannot apply it on further returned nodes in the lookup and you also cannot only contact nodes that have the radius that covers the content. Not sure if this exactly what you meant with you above statement, because if you do this, a search can die out fairly quickly if your local node is located far away (in xor distance) from the content id.
Interesting. Yes, I actually did something like this. I tested and it did help me find the content successfully but I'm not planning to go ahead with the change at least in that form. Probably a better change would be to do some sorting on the initial list of nodes so that we query nodes that have a radius that covers the content first.
fyi: it is definitely not a consistent failure and might be very NodeId + routing table filling + geographic location dependent. I have tested it myself 10x, each time on a fresh node with new node id running literally for 10 seconds, and each time the request worked for me.
Good to know. I was able to get the data back using Trin on the same network so it is odd. I wonder if my Fluffy enr is just really far from the content.
@kdeme What did you mean by this? 'We could prioritize on nodes for which the data radius actually falls in the target range (if we know the data radius of the node)'
Just curious to understand further, what your suggestion was for this task.
Just curious to understand further, what your suggestion was for this task.
The target = requested content. Thus prioritized requests to nodes that are likely to have the content, based on their radius.
Yes I found that flag. I was thinking to filter out the nodes by setting
seen == true
and then only query nodes in the radius cache that are in range of the content.
I don't think it is a good idea to only query those nodes. I think they should be prioritized, which is what this issue is about, but if they turn out useless results, other nodes should still be queried as they can still provided useful peers.
edit: to further comment on this. It is an optimization, to avoid additional hops in getting to the right peers. It should not be needed to "fix" a current content request failure.
Yes I found that flag. I was thinking to filter out the nodes by setting
seen == true
and then only query nodes in the radius cache that are in range of the content.I don't think it is a good idea to only query those nodes. I think they should be prioritized, which is what this issue is about, but if they turn out useless results, other nodes should still be queried as they can still provided useful peers.
Thanks, I see what you mean. Actually my second attempt to improve the query did sorting rather than filtering out so I was headed in the same direction as what you suggest.
edit: to further comment on this. It is an optimization, to avoid additional hops in getting to the right peers. It should not be needed to "fix" a current content request failure.
Well I believe it is a problem related to making our implementation more tolerant to failing nodes in our routing table so I was trying to fix the scenario when the first 16 nodes all fail to respond or have a radius outside the content. I was thinking of doing something like this:
var closestNodes =
p.routingTable.neighbours(targetId, p.routingTable.len(), seenOnly = false)
# Shuffling the order of the nodes in order to not always hit the same node
# first for the same request.
p.baseProtocol.rng[].shuffle(closestNodes)
proc nodesCmp(x, y: Node): int =
let
xRadius = p.radiusCache.get(x.id)
yRadius = p.radiusCache.get(y.id)
if xRadius.isSome() and p.inRange(x.id, xRadius.unsafeGet(), targetId):
-1
elif yRadius.isSome() and p.inRange(y.id, yRadius.unsafeGet(), targetId):
1
else:
0
closestNodes.sort(nodesCmp)
This basically prioritizes nodes that are within the radius of the content but otherwise the order of the nodes should be unchanged. I did notice the query performance improve after this change but I'm not sure if it will create other issues. In general I think we should think of a way to filter out nodes that are not responding or sending empty responses as well as aiming to query nodes having the correct radius range as you said.
So there are several items being discussed here:
Changes for this implemented in this PR: https://github.com/status-im/nimbus-eth1/pull/2841
When starting a recursiveFindContent request, it first looks up the closest nodes in the local routing table. Then continiously trying to contact closer nodes.
However, data radius of the nodes also plays a role here. We could prioritize on nodes for which the data radius actually falls in the target range (if we know the data radius of the node). It is however to be seen if this added complexity is worth it in terms of amount of nodes we need to contact on average before we find the content.