This PR updates few constants used in block fetcher to suit PoS block timings i.e. 2s as compared to ethereum which is 12s.
Rationale
Block fetcher handles downloading of block headers and bodies near the chain tip as it acts upon the block and block hash announcements. A constant maxQueueDist is used to queue the block announcements and it's value was initially set to 32. This means that it only stores the block announcements of last 32 blocks from the chain tip. Rest are dropped with log Peer discarded announcement.
While debugging the sync issues observed recently, we identified that old block announcements were dropped while newer ones were included. This used to stall the import as we would have some missing blocks in this case. It can only proceed in 2 cases.
Missing blocks are queried again -- as we've dropped the announcement way earlier, it's difficult to do so in fetcher (may require some rework but there are easy options)
Downloader kicks in to find gaps and download blocks -- there's a timer to this and one has to wait until it triggers. This also works under the assumption that we received the new chain head from a peer with updated total difficulty. Else, this can also stall the import.
E.g. Let's say the chain tip is at 100. Bor received block announcements till 132 and it's about to process these blocks. At the same time, block announcement for 133 arrives and is dropped because the different with chain tip is > 32. More announcements can also be dropped if the original import takes more time. For simplicity, let's say the import of 100-132 finished and chain head is now 132. Bor can now accept new block announcements i.e. 134, 135, ... and so on. But, because 133 was dropped, it doesn't have any context about that block and it keeps waiting despite having more blocks lined up. There's no clear way to proceed at this point.
The simplest option for now is to increase the queue limit to store more block announcements and not drop them. The ideal solution is to use a value which is 6x the original value because of the 6x block time.
Changes
[x] Bugfix (non-breaking change that solves an issue)
[ ] Hotfix (change that solves an urgent issue, and requires immediate attention)
[ ] New feature (non-breaking change that adds functionality)
[ ] Breaking change (change that is not backwards-compatible and/or changes current functionality)
[ ] Changes only for a subset of nodes
Checklist
[x] I have added at least 2 reviewer or the whole pos-v1 team
[x] I have added sufficient documentation in code
[x] I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
[x] Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
[ ] Includes RPC methods changes, and the Notion documentation has been updated
Description
This PR updates few constants used in block fetcher to suit PoS block timings i.e. 2s as compared to ethereum which is 12s.
Rationale
Block fetcher handles downloading of block headers and bodies near the chain tip as it acts upon the block and block hash announcements. A constant
maxQueueDist
is used to queue the block announcements and it's value was initially set to 32. This means that it only stores the block announcements of last 32 blocks from the chain tip. Rest are dropped with logPeer discarded announcement
.While debugging the sync issues observed recently, we identified that old block announcements were dropped while newer ones were included. This used to stall the import as we would have some missing blocks in this case. It can only proceed in 2 cases.
E.g. Let's say the chain tip is at 100. Bor received block announcements till 132 and it's about to process these blocks. At the same time, block announcement for 133 arrives and is dropped because the different with chain tip is > 32. More announcements can also be dropped if the original import takes more time. For simplicity, let's say the import of 100-132 finished and chain head is now 132. Bor can now accept new block announcements i.e. 134, 135, ... and so on. But, because 133 was dropped, it doesn't have any context about that block and it keeps waiting despite having more blocks lined up. There's no clear way to proceed at this point.
The simplest option for now is to increase the queue limit to store more block announcements and not drop them. The ideal solution is to use a value which is 6x the original value because of the 6x block time.
Changes
Checklist
Manual tests
Tested sync on an internal mainnet node