mosaicnetworks / babble

Distributed Consensus Middleware
MIT License
478 stars 95 forks source link

[Need Help] How to restore 4 Suspended nodes without clean the data? #156

Closed diyism closed 2 years ago

diyism commented 2 years ago

Sorry for bothering you, but I can't find an answer about restoring Suspended nodes in all the old issues.

How to restore 4 Suspended nodes without cleaning the data?

I started 4 nodes with: babble run --store --bootstrap --heartbeat=100ms --moniker=node$i --cache-size=50000 --listen=172.77.5.$i:1337 --proxy-listen=172.77.5.$i:1338 --client-connect=172.77.10.$i:1339 --service-listen=172.77.5.$i:80 --sync-limit=100 --fast-sync=false --log=debug --webrtc=false --signal-addr=172.77.15.1:2443

and I can see all the states are "Babbling" and the "consensus_events" are synced in http://172.77.5.x/stats

but I accidently shut down 2 nodes, so I restarted them, and now 3 nodes' states are "Suspended" and 1 node's state is "Babbling", all the "consensus_events" are 543, but the "undetermined_events" are 4193, 4492, 4073, 4102 respectively.

I've tried "babble run" with "--fast-sync" parameter, it doesn't work.

I know I can clean all the data of the 4 nodes with: rm -rf ~/.babble/badger_db* and then restart all of them, but I'm curious if there's a way to restore the consensus of the 4 nodes without removing all data, any idea?

arrivets commented 2 years ago

Hey! So I think what happened is that when the two nodes were shut down, the other nodes kept gossipping and creating new events that were not reaching consensus (because you need at least 3 good nodes out of 4). So the two nodes that remained online continued creating "undetermined" events until they number of undetermined events exceeded the suspend limit.

I am afraid we have no tools to recover from that. What we did in Monet, was to start the node in maintenance mode, and export the application state (in a new genesis file) at the point where the network broke, and restart a fresh network with this new genesis file.

But it would be good to have a tool to cleanup the db when this suspend-limit is hit. Some way to look into the db and delete all undetermined events maybe. Unfortunately I don't have time to work on this anymore.

Hope this helps

diyism commented 2 years ago

OK, thanks for your response.