Open siladu opened 1 week ago
@matkt How confident are you auto-heal would fix this if the trie node were to be accessed?
Is there a regression test for the auto-heal generally?
Do you think this should be higher priority than P3?
I have a fix we can try https://github.com/hyperledger/besu/pull/7624 I think I found the root cause
Regarding auto heal , a simple test is to remove with Bela the root node of a contract that is used a lot and wait for a transaction that touch this one. We don't have a good test for this part , this should be removed in the future because it is switching FULL flat db to PARTIAL flat db
Ran fix #7624 on 20 nodes and found no error in either besu or bonsai tree verifier logs. Good news, but I think we need more testing to be fully confident though.
There was 1 rocksdb busy error during trie heal, which recovered.
Using https://github.com/hyperledger/besu/compare/main...matkt:feature/fix-healing-busy-issue?expand=1, synced another 20 nodes with no bugs ✅ x20
Two more recoverable RocksDB warnings during trie heal though.
I have tested this fix #7624 with total 80 nodes (using mainnet checkpoint sync) and not found the issue 🎉
2/80 did suffer from #7619 but it's unrelated to this fix.
Successfully synced and bug-free nodes can occasionally still have trie node data missing, as discovered through running Bela BonsaiTreeVerifier
For example:
Frequency: Three occurrences in recent burn ins.
Each time it has been the storage root hash,
0x
that has been missing. Some child node data is present for this storage trie..More details: https://github.com/Consensys/protocol-misc/issues/972#issuecomment-2339351262