Bela Verification Failed - account hash missing storage trie node

hyperledger / besu

An enterprise-grade Java-based, Apache 2.0 licensed Ethereum client https://wiki.hyperledger.org/display/besu

https://www.hyperledger.org/projects/besu

Apache License 2.0

1.48k stars 809 forks source link

Bela Verification Failed - account hash missing storage trie node #7620

Open siladu opened 1 week ago

siladu commented 1 week ago

Successfully synced and bug-free nodes can occasionally still have trie node data missing, as discovered through running Bela BonsaiTreeVerifier

For example:

account hash 0x615cd098e18cd36c5a62016fb4e096c02cf71532abcb04f2e7a5661972a86932 
missing storage trie node for hash
 0x9987f9dc77f7516f7710cbf0514f6941d3905bc9c4e14c864586fea3f42856a6 and location 0x

Frequency: Three occurrences in recent burn ins.

Each time it has been the storage root hash, 0x that has been missing. Some child node data is present for this storage trie..

More details: https://github.com/Consensys/protocol-misc/issues/972#issuecomment-2339351262

siladu commented 1 week ago

@matkt How confident are you auto-heal would fix this if the trie node were to be accessed?

Is there a regression test for the auto-heal generally?

Do you think this should be higher priority than P3?

matkt commented 1 week ago

I have a fix we can try https://github.com/hyperledger/besu/pull/7624 I think I found the root cause

Regarding auto heal , a simple test is to remove with Bela the root node of a contract that is used a lot and wait for a transaction that touch this one. We don't have a good test for this part , this should be removed in the future because it is switching FULL flat db to PARTIAL flat db

siladu commented 1 week ago

Ran fix #7624 on 20 nodes and found no error in either besu or bonsai tree verifier logs. Good news, but I think we need more testing to be fully confident though.

There was 1 rocksdb busy error during trie heal, which recovered.

siladu commented 6 days ago

Using https://github.com/hyperledger/besu/compare/main...matkt:feature/fix-healing-busy-issue?expand=1, synced another 20 nodes with no bugs ✅ x20

Two more recoverable RocksDB warnings during trie heal though.

siladu commented 3 days ago

I have tested this fix #7624 with total 80 nodes (using mainnet checkpoint sync) and not found the issue 🎉

2/80 did suffer from #7619 but it's unrelated to this fix.