Closed sgerbino closed 1 year ago
One guess as to what may be happening:
Of course, sending big blocks over a slow connection takes a long time. Possibly enough to trip the timeout.
A couple things we could do:
Personally I think (1) isn't the right approach (slow loris issues and implementing it may require getting deep into the plumbing of libp2p), (2) is more viable but doesn't fit with current architecture (we need to stream the response instead of creating it in one shot), and by process of elimination that leaves us to pick (3).
I suggest a size threshold of 3x max block size, typical bandwidth of 400k / sec, and a safety factor of 3. So this leaves us with the following implementation sketch:
GetBlocksResponse
returns fewer blocks than requested. (I don't think we actually need to change protobuf struct definitions, we just need to look at the client-side code and make sure it's prepared to accept Blocks[]
may be shorter than the requested NumBlocks
.)GetBlocks()
marshals blocks to response.Blocks
, it keeps track of the running total size of all blocks in the response.GetBlocks()
truncates response.Blocks
and returns fewer blocks than requested.GetBlocks()
response of 1.5 MB.(I assumed max block size of 0.5 MB but I'm not sure if this is actually the case. If max block size is larger than 0.5 MB we should scale all the above numbers proportionally.)
If this is implemented, we should also create a separate issue on the block store to implement a size threshold there. (This is an optimization so that e.g. if we get into some large blocks, we can avoid a situation where the block store reads 1000 large blocks and sends them to the p2p, only for the p2p to consume like 5 of them, decide that its size threshold is reached and it doesn't want any more, then throw away the remaining 995.)
Related to #245.
Possibly closed by #250
We need to spot check this once the network has updated to a recent version of p2p
Possibly resolved by #245.
Is there an existing issue for this?
Current behavior
While syncing a large portion of peers is showing remote RPC timeouts causing error scores to rise quickly.
Expected behavior
Remote RPC timeouts should be seldom when a peer is actually unresponsive.
Steps to reproduce
Environment
Anything else?