Closed Eligioo closed 5 months ago
Here is another example with version 0.15.0. This log contains 2 epochs of data. For both epochs block 2980799 to 2980800 (election block) and 3002399 to 3002400 (election block). The error messages
Failed to dispatch gossipsub message topic=blocks error=no available capacity
and
Mempool-verify tx invalid at this block height
start to show up. The node is unable to apply the election block and falls behind, after some time it catches up. If this node happens to be an elected validator, it will be deactivated and because it can't enforce the validity window anymore, it will not reactivate.
As requested by @jsdanielh, an example in 0.18.0.
2023-12-22T05:23:15.950207573Z DEBUG push | Accepted block block=#11458449:MI:8f7c8e758a num_transactions=127 kind="extend"
2023-12-22T05:23:26.920513733Z DEBUG block_queue | Received block via gossipsub block=#11458450:MA:ddcf6efd90 peer_id=12D3KooWGbKfQvV4r59BCYHoRzGbARTRFsxbx5SpUQb8C3ciQSiM
Then 15 minutes later
2023-12-22T05:38:21.353751307Z DEBUG push | Accepted block block=#11458450:MA:ddcf6efd90 num_transactions=0 kind="extend"
2023-12-22T05:38:21.483950188Z DEBUG push | Accepted block block=#11458451:MI:a7a98a8ab8 num_transactions=0 kind="extend"
2023-12-22T05:38:21.506463786Z DEBUG push | Accepted block block=#11458452:MI:027e23ebd2 num_transactions=0 kind="extend"
It is able to accept blocks up to 11458465
and then it halts since it never requests/receives block 11458466
. Other blocks gossipsubs keep coming in but nothing gets done with it.
Full log of this example: full-node-election-block-issue.log
Per @viquezclaudio reproduction, this slowness is caused by history pruning on election blocks
We have pushed a temporary solution to this issue that makes the history full nodes keep the history instead of pruning upon epoch finalization.
The real solution would be to have a lighter version of the history store for full nodes.
I'm guessing you are talking about https://github.com/nimiq/core-rs-albatross/commit/e4bb4a6a4e34e0d2f0501647b04fc5e57417d56f. You have a typo in your comment, this is for full nodes, not history nodes.
Also, does this commit mean that full nodes are now history nodes with this "workaround" and need much more disk space?
For now yes. Testnet full nodes will start storing everything from the next testnet restart. We are working on a Light History Store that is efficient at pruning and also stores much less data.
That's a pretty significant change to full node behavior, even if just as a workaround.
However, I guess it will not sync the whole history on startup? It only does not delete history? Which means that if after a while I stop the node, delete the database, and restart it, the node will not sync back the deleted history, but start with a small database and only grow with new blocks?
However, I guess it will not sync the whole history on startup? It only does not delete history? Which means that if after a while I stop the node, delete the database, and restart it, the node will not sync back the deleted history, but start with a small database and only grow with new blocks?
That is correct. It will use the same syncing mechanism as before to get to the head of the chain. But from there on it won't remove old epochs as soon a we hit an election block.
Correct, this is just a temporal workaround while we implement the right solution: a special lightweight version of the history store for full nodes, that allows us to properly construct (and verify the history root) without having to store the full transactions.
Currently fully synced full nodes are having the issue that after some event it loses the ability to store new incoming blocks and therefore can't keep up.
Multiple incidents have been observed and so far they all show the same pattern just before it can't continue to store blocks:
After this only three log messages are displayed and interestingly the receiving blocks via
gossipsub
is one of them. Which gives the feeling that the connections with other nodes aren't actually dropped but the node is unable to process them. Deadlock?Note that the consensus head hangs at #233038 but new
gossipsub
messages are coming in. This looks similar to #1692 .At some point the node falls so far behind that verification for mempool txs start to fail
full-node-cant-keep-up.log