Testing PR #137 has taken a long time due to unrelated issues causing test nodes to often just hang and do nothing, sometimes crash, sometimes hang and then crash, etc:
The old_heights in get_get_blocks_message only went back a maximum of 4096 blocks, this made it difficult to restart nodes after 5 days, because the fall back is to restart IBD from height=1
There was a race condition in the call to start_sending(), sometimes this caused nodes to just hang
The disconnect() logic didn't handle errors correctly, sometimes causing nodes to hang
It was hard to tell the status of other peers and whether this node or other nodes were stuck
The logic in ChainManager.step() was causing multiple IBDs to be initiated flooding the local peer and causing IBD to slow to a crawl
Asking for more inventory at the end of inventory could hit a race condition, and cause IBD to restart from height=1, causing IBD to slow to a crawl
These bugs have been fixed now in peer networking performance fixes. Ideally these changes would be a separate PR, unfortunately, they have only been tested together with the other changes in the PR.
Testing PR #137 has taken a long time due to unrelated issues causing test nodes to often just hang and do nothing, sometimes crash, sometimes hang and then crash, etc:
old_heights
inget_get_blocks_message
only went back a maximum of 4096 blocks, this made it difficult to restart nodes after 5 days, because the fall back is to restart IBD fromheight=1
start_sending()
, sometimes this caused nodes to just hangdisconnect()
logic didn't handle errors correctly, sometimes causing nodes to hangChainManager.step()
was causing multiple IBDs to be initiated flooding the local peer and causing IBD to slow to a crawlheight=1
, causing IBD to slow to a crawl