nimiq / core-rs-albatross

Rust implementation of the Albatross protocol
https://nimiq.com
Other
163 stars 66 forks source link

Full node unable to apply election block #1704

Closed Eligioo closed 5 months ago

Eligioo commented 1 year ago

Currently fully synced full nodes are having the issue that after some event it loses the ability to store new incoming blocks and therefore can't keep up.

Multiple incidents have been observed and so far they all show the same pattern just before it can't continue to store blocks:

2023-07-13T13:04:40.637552192Z INFO  micro                | Skip block pushed block_number=233038
2023-07-13T13:04:40.663131683Z DEBUG validator            | Initializing block producer next_block_number=233039
2023-07-13T13:04:40.672803841Z DEBUG micro                | Not our turn, waiting for micro block #233039 block_number=233039 slot_band=4
2023-07-13T13:04:40.674427687Z DEBUG block_queue          | Received block via gossipsub block=#233039:MI:5633601d56
2023-07-13T13:04:40.674597462Z DEBUG block_queue          | Received block via gossipsub block=#233038:MI:0d572aaf9d
2023-07-13T13:04:40.992743659Z INFO  chain_ordering       | New block in Unknown chain fork_block_number=233037 current_block_number=233038 new_block_number=233038
2023-07-13T13:04:40.992827717Z DEBUG push                 | Storing block - on fork block=#233038:MI:74af9b8a35
2023-07-13T13:04:42.066183858Z INFO  chain_ordering       | New block in Inferior chain fork_block_number=233037 current_block_number=233038 new_block_number=233038
2023-07-13T13:04:42.066278922Z DEBUG push                 | Storing block - on inferior chain block=#233038:MI:0d572aaf9d
2023-07-13T13:04:42.681963371Z DEBUG block_queue          | Received block via gossipsub block=#233040:MA:686b97ed50

After this only three log messages are displayed and interestingly the receiving blocks via gossipsub is one of them. Which gives the feeling that the connections with other nodes aren't actually dropped but the node is unable to process them. Deadlock?

2023-07-13T13:04:43.736174015Z ERROR network_impl         | could not send request peer_id=12D3KooWAjftHreEDsNMK3qPBSe63LZcQorv6q8H3FR7CWg57CxV error=Inbound error: No receiver for request
2023-07-13T13:04:43.889421145Z DEBUG consensus            | Initiating head requests (to 11 peers)
2023-07-13T13:04:45.422861040Z DEBUG block_queue          | Received block via gossipsub block=#233040:MA:686b97ed50

Note that the consensus head hangs at #233038 but new gossipsub messages are coming in. This looks similar to #1692 .

2023-07-13T13:06:18.681751953Z DEBUG block_queue          | Received block via gossipsub block=#233127:MI:2373e55674
2023-07-13T13:06:09.389811588Z INFO  nimiq_client         | Consensus status: true - Head: #233038- 5506df7b33945b78f7985aa0b20a876b55ee43e90575349aa0480dc53303c1a2 block_number=233038 num_peers=14 status=true
2023-07-13T13:06:29.675350336Z DEBUG block_queue          | Received block via gossipsub block=#233138:MI:0ac26a9b0e
2023-07-13T13:06:29.389776842Z INFO  nimiq_client         | Consensus status: true - Head: #233038- 5506df7b33945b78f7985aa0b20a876b55ee43e90575349aa0480dc53303c1a2 block_number=233038 num_peers=14 status=true
2023-07-13T13:06:48.813880555Z DEBUG block_queue          | Received block via gossipsub block=#233157:MI:379d8d4a25
2023-07-13T13:06:49.389738268Z INFO  nimiq_client         | Consensus status: true - Head: #233038- 5506df7b33945b78f7985aa0b20a876b55ee43e90575349aa0480dc53303c1a2 block_number=233038 num_peers=14 status=true

At some point the node falls so far behind that verification for mempool txs start to fail

2023-07-13T13:05:52.071047472Z DEBUG verify               | Mempool-verify tx invalid at this block height block_number=233039 validity_start_height=233100

full-node-cant-keep-up.log

Eligioo commented 1 year ago

Here is another example with version 0.15.0. This log contains 2 epochs of data. For both epochs block 2980799 to 2980800 (election block) and 3002399 to 3002400 (election block). The error messages

Failed to dispatch gossipsub message topic=blocks error=no available capacity and Mempool-verify tx invalid at this block height start to show up. The node is unable to apply the election block and falls behind, after some time it catches up. If this node happens to be an elected validator, it will be deactivated and because it can't enforce the validity window anymore, it will not reactivate.

nimiq-client.log

Eligioo commented 10 months ago

As requested by @jsdanielh, an example in 0.18.0.

2023-12-22T05:23:15.950207573Z DEBUG push                 | Accepted block block=#11458449:MI:8f7c8e758a num_transactions=127 kind="extend"
2023-12-22T05:23:26.920513733Z DEBUG block_queue          | Received block via gossipsub block=#11458450:MA:ddcf6efd90 peer_id=12D3KooWGbKfQvV4r59BCYHoRzGbARTRFsxbx5SpUQb8C3ciQSiM

Then 15 minutes later

2023-12-22T05:38:21.353751307Z DEBUG push                 | Accepted block block=#11458450:MA:ddcf6efd90 num_transactions=0 kind="extend"
2023-12-22T05:38:21.483950188Z DEBUG push                 | Accepted block block=#11458451:MI:a7a98a8ab8 num_transactions=0 kind="extend"
2023-12-22T05:38:21.506463786Z DEBUG push                 | Accepted block block=#11458452:MI:027e23ebd2 num_transactions=0 kind="extend"

It is able to accept blocks up to 11458465 and then it halts since it never requests/receives block 11458466. Other blocks gossipsubs keep coming in but nothing gets done with it.

Full log of this example: full-node-election-block-issue.log

jsdanielh commented 9 months ago

Per @viquezclaudio reproduction, this slowness is caused by history pruning on election blocks

ii-cruz commented 9 months ago

We have pushed a temporary solution to this issue that makes the history full nodes keep the history instead of pruning upon epoch finalization. The real solution would be to have a lighter version of the history store for full nodes.

sisou commented 9 months ago

I'm guessing you are talking about https://github.com/nimiq/core-rs-albatross/commit/e4bb4a6a4e34e0d2f0501647b04fc5e57417d56f. You have a typo in your comment, this is for full nodes, not history nodes.

Also, does this commit mean that full nodes are now history nodes with this "workaround" and need much more disk space?

ii-cruz commented 9 months ago

For now yes. Testnet full nodes will start storing everything from the next testnet restart. We are working on a Light History Store that is efficient at pruning and also stores much less data.

sisou commented 9 months ago

That's a pretty significant change to full node behavior, even if just as a workaround.

However, I guess it will not sync the whole history on startup? It only does not delete history? Which means that if after a while I stop the node, delete the database, and restart it, the node will not sync back the deleted history, but start with a small database and only grow with new blocks?

Eligioo commented 9 months ago

However, I guess it will not sync the whole history on startup? It only does not delete history? Which means that if after a while I stop the node, delete the database, and restart it, the node will not sync back the deleted history, but start with a small database and only grow with new blocks?

That is correct. It will use the same syncing mechanism as before to get to the head of the chain. But from there on it won't remove old epochs as soon a we hit an election block.

viquezclaudio commented 9 months ago

Correct, this is just a temporal workaround while we implement the right solution: a special lightweight version of the history store for full nodes, that allows us to properly construct (and verify the history root) without having to store the full transactions.