synechron-finlabs / quorum-maker

Utility to create and monitor Quorum nodes
Apache License 2.0
196 stars 97 forks source link

How to deal with catastrophic node failure? #111

Open MaxBinnewies opened 5 years ago

MaxBinnewies commented 5 years ago

Hi guys,

I am currently trying to setup quorum with quorum-maker in our cloud infrastructure. I have set it up on three VMs and it all seems fine, they are connected and our application can send transactions to the smart contracts and those transactions show up in the UI.

However, I am worried about stability. Previously we tried running it on Kubernetes and had issues with pods just crashing. So for the VM based solution I am looking for a way to recover a node when it fails catastrophically. Meaning what to do when it not only shuts down, but crashes in a manner where it can't be restarted. We are looking to running quorum in a production environment very soon, so these issues need to be addressed.

I see three possible paths for that, but am not really getting anywhere with either:

1) My initial idea was to just replace all quorum related files in the directory with a backup of the same directory I took right after start-up. The VM, IP, other nodes, etc would all stay the same. However, the problem is, that the other nodes seem to remember what block this node was one. It complaints about getting a different block number than what it expects. Note this is still the same chain, just in earlier state than when the node was shutdown. I was hoping it would just sync back up from the old backup state.

2) My second thought was similar to 1). Just delete everything, and start the node fresh (run setup.sh again), on the same machine, with the same IP, so it would just take the old ones place. However, it complains that the "enode"-id is not matching and I can't figure out how to change that ID for the new node.

3) My third idea is to remove the node from the network entirely and start a new one. However, I can't figure out how to remove a node from the network, even if it is inactive. I found a quorum wiki-page: http://docs.goquorum.com/en/latest/Consensus/raft/ But it only says "attach to a JS console and issue raft.removePeer(raftId)" without explaining how to actually do that. How do I get this mythical console? I thought maybe it's related to geth. So I attached a geth terminal to my nodes and it actually says that it has a "module" raft, whatever that means. However, when I type raft it just says "ReferenceError: 'raft' is not defined"

Help with either of these three approaches or general advice on how to handle broken nodes would be greatly appreciated.

Thanks, Max

abhayar commented 5 years ago

@MaxBinnewies As of now it seems that there is no solution to restore quorum chain once it crashed. You have to create setup new quorum raft network again. For 2nd point: Didn't get your point. You mentioned delete everything and start node fresh. If you want to start existing node, you just have to run start.sh file to start node. Setup.sh is use to create network only once. If you creating fresh quorum nodes network, then enode id it automatically created, we don't worry about it. Could you please reframe "complains that the "enode"-id is not matching" this again ,

For 3rd point: If you are using quorum maker , then nodes are running under raft consensus only. To remove inactive node from network, you have to run raft.removePeer() from geth console. To attach geth console to running node, just use below command from your node's qdata directory. ex: cd quorum-maker/node1/node/qdata geth attach geth.ipc raft.removePeer(2), here 2 is raftId of inactive node , you want to remove from the network response will be "null"