Closed maxpert closed 1 year ago
Thanks for bring up the issue. What you have experienced is a common mistake by users so I'd like to offer a detailed explanation on this.
In a consensus protocol, typically, logs have to be persistently stored before an acknowledgement can be sent to confirm that the log has been successfully replicated onto a node. Such such persistency means that once acknowledged the log must be there as long as the node (with its intended ID) is there. This is required so whichever node trying to replicate the log can have a strong assurance that the replica exists on the expected remote node, even after reboots. Note that it is totally fine to destroy that node altogether - it is totally ok to have both the node and its data gone.
This is actually explained in the doc/devops.md. You may want to have a read on that doc first.
When you manually remove those directories, it basically violates the above mentioned assurance. You can also look at this whole issue from a different angle - your node responded to the leader that replicated logs were all persistently stored, but all logs disappeared after the reboot because they were deleted. From leader's point of view, it got tricked by that node - the acknowledgement became totally unreliable as the replicated content can disappear at anytime after such acknowledgement. In short, using computer science terms, you basically injected byzantine fault into a typical non-byzantine system.
Dragonboat ensures its data integrity, but as a fully decentralized approach, sadly there is nothing it can do to automatically tell whether it is a fresh new node without any data or it is a restarted node with all its data deleted. Please just think about the simplest deployment where you just have one replica in the shard (cluster in v3.0's term).
More details on why there is nothing dragonboat can do to detect such "someone deleted all my data" incident -
For shard with a single replica, obviously there is nothing you can do once someone can delete all data that belong to the replica. on startup, it is just a node with no data, there is absolutely no way to tell whether there is no data because it is brand new or because someone deleted all its data.
For shard with say 3 replicas, which is probably most common setting, let's say there are nodes A, B and C. Assuming A is the leader and it is replicating logs to B without any issue, this is enough to have all logs committed and applied as a quorum of two is met (nodes A and B). Let's say C is isolated from A and thus without any knowledge of the committed logs from A. Again, this is totally fine, it is the exact feature we want from raft/paxos. Now let's say A is totally destroyed, the machine & its stored logs are just gone. This is still fine as we all data on B, on startup, it should be elected as leader and start replicating its logs to C. However, if we manually remove all data stored by B and restart node B, this is going to be a problem - both B and C now know nothing about committed logs, both of them would believe that B is a brand new node. There is simply no way to tell something really bad (B's data got removed and its persistent storage assurance got violated) happened.
To largely prevent such issue - your application (not dragonboat) can use a separated storage to store some metadata on whether the node is brand new or not. As described in Paxos Made Live (linked below, see section 5.1), google did this by placing a marker in GFS (which is a totally different storage service from where/how logs were stored) once a node is first booted. On restart, if there is no data but a marker in GFS, you can tell that something bad happened.
https://static.googleusercontent.com/media/research.google.com/en//archive/paxos_made_live.pdf
Dragonboat can't do this automatically - there is no such separate storage known & accessible by dragonboat whatsoever.
The above google approach largely solves the problem, but it doesn't help when you can also remove the marker - again, see the examples above. When you assume that someone has the ability to delete all data relevant to a node, there is no way that node can tell whether it is brand new or not.
The bottom line here is dead simple - Raft/Paxos is a non-byzantine system, when a node confirms that certain logs have been persistently stored on that node, such data must be persistently stored on that node. For whatever excuses/reasons, if nodes are allowed to cheat on such confirmation, your system will be in a big big trouble.
@lni i would like to use dragonboat ini ondisk for production. possible to fix it so that deleted node / starting a new one will automatically recover based on some sort of production ready mechanism?
it's quite obvious that u shldnt need a backup of those logs since this is a clustered implementation. it's assumed that one node will fail with disk corruption "eventually".
so the question is:
@lni can u pls help make ondisk example production ready with the failure scenario as mentioned so we can all use it in production without worry? this is an expected working feature for the core of dragonboat.
appreciate this.
@sprappcom this is NOT a bug, it is not something anyone can fix.
@lni are u saying etcd doesnt solve this issue too?
how can this kind of system be used in production if one node's disk is corrupted?
Hi @sprappcom
It is not really an issue in the first place, it is how Paxos/Raft suppose to work. Please read the explanation provided above again, particularly for that concrete example provided (pasted below again for your connivence).
For shard with say 3 replicas, which is probably most common setting, let's say there are nodes A, B and C. Assuming A is the leader and it is replicating logs to B without any issue, this is enough to have all logs committed and applied as a quorum of two is met (nodes A and B). Let's say C is isolated from A and thus without any knowledge of the committed logs from A. Again, this is totally fine, it is the exact feature we want from raft/paxos. Now let's say A is totally destroyed, the machine & its stored logs are just gone. This is still fine as we all data on B, on startup, it should be elected as leader and start replicating its logs to C. However, if we manually remove all data stored by B and restart node B, this is going to be a problem - both B and C now know nothing about committed logs, both of them would believe that B is a brand new node. There is simply no way to tell something really bad (B's data got removed and its persistent storage assurance got violated) happened.
For shard with say 3 replicas, which is probably most common setting, let's say there are nodes A, B and C. Assuming A is the leader and it is replicating logs to B without any issue, this is enough to have all logs committed and applied as a quorum of two is met (nodes A and B). Let's say C is isolated from A and thus without any knowledge of the committed logs from A. Again, this is totally fine, it is the exact feature we want from raft/paxos. Now let's say A is totally destroyed, the machine & its stored logs are just gone. This is still fine as we all data on B, on startup, it should be elected as leader and start replicating its logs to C. However, if we manually remove all data stored by B and restart node B, this is going to be a problem - both B and C now know nothing about committed logs, both of them would believe that B is a brand new node. There is simply no way to tell something really bad (B's data got removed and its persistent storage assurance got violated) happened.
A's node get destroyed in a fire.
how do i create D to replace A?
ondisk example doesnt work this way
When replica A got destroyed in a fire, you just bring up another replica and name it as replica D, you can't name the new replica as A because that will confuse all other replicas - those replicas expect A to have all data up to certain previously acknowledged point, but the new replica being used to replace A doesn't have such data, from other replica's point of view, if that new replica used to replace A call itself A, then all A's previous acknowledgements are no longer dependable or trustworthy anymore, everything got stuck.
Please read the Paxos/Raft if you want to use it in production, all details are in the paper.
@lni ok i "get it" now. so we still have to make backups of the replica's replica. gosh. this really defeats the purpose of this raft/paxo thing.
i tot there must surely be a way to inform B and C that A is destroyed and A can be "revived" from ashes.
@sprappcom
Please read the raft paper and the mentioned doc/devops.md doc to have a better understanding on the protocol itself.
So I am using dragonboat in marmot and I can now consistently reproduce a crash, when bringing up a dead node when bringing it up for cluster after the node has lost raft logs, and snapshots have not been taken. Look at repro steps below. It happens with a race condition so sometimes it doesn't crash, but couple of attempts and you can reproduce it.
Dragonboat version
github.com/lni/dragonboat/v3 v3.3.5
Expected behavior
Node should get the logs, and just replay those logs to bring it back into cluster.
Actual behavior
Steps to reproduce the behavior
StartOnDiskCluster
on all nodes, and member list of each node pointing to other nodes (i.e. fully connected)rm -rf
the raft directory so that it looses all logs but not the underlaying disk persistent clusterrm -rf
step couple of times (2 or 3rd attempt) it should crash