Bringing back a dead node with no raft logs folder causes panic

maxpert commented 1 year ago

So I am using dragonboat in marmot and I can now consistently reproduce a crash, when bringing up a dead node when bringing it up for cluster after the node has lost raft logs, and snapshots have not been taken. Look at repro steps below. It happens with a race condition so sometimes it doesn't crash, but couple of attempts and you can reproduce it.

Dragonboat version

github.com/lni/dragonboat/v3 v3.3.5

Expected behavior

Node should get the logs, and just replay those logs to bring it back into cluster.

Actual behavior

{"level":"info","time":"2022-09-22T07:28:55-07:00","message":"Waiting for cluster to come up..."}
2022-09-22 07:28:56.095419 C | raft: invalid commitTo index 8, lastIndex() 3
panic: invalid commitTo index 8, lastIndex() 3

goroutine 319 [running]:
github.com/lni/goutils/logutil/capnslog.(*PackageLogger).Panicf(0x26?, {0x4bdb758?, 0xc0001a87b0?}, {0xc0008361e0?, 0xc00034bb00?, 0xc000ba7e30?})
        /Users/zohaib/go/pkg/mod/github.com/lni/goutils@v1.3.0/logutil/capnslog/pkg_logger.go:88 +0xbb
github.com/lni/dragonboat/v3/logger.(*capnsLog).Panicf(0xc00034bb00?, {0x4bdb758?, 0x4010a27?}, {0xc0008361e0?, 0x4a6ebe0?, 0xc000ba7f01?})
        /Users/zohaib/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/logger/capnslogger.go:74 +0x26
github.com/lni/dragonboat/v3/logger.(*dragonboatLogger).Panicf(0xc000044210?, {0x4bdb758, 0x29}, {0xc0008361e0, 0x2, 0x2})
        /Users/zohaib/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/logger/logger.go:132 +0x57
github.com/lni/dragonboat/v3/internal/raft.(*entryLog).commitTo(0xc00025a150, 0x8)
        /Users/zohaib/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/internal/raft/logentry.go:328 +0xfe
github.com/lni/dragonboat/v3/internal/raft.(*raft).handleHeartbeatMessage(_, {0x11, 0x4, 0x1, 0x3, 0x3, 0x0, 0x0, 0x8, 0x0, ...})
        /Users/zohaib/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/internal/raft/raft.go:1317 +0x45
github.com/lni/dragonboat/v3/internal/raft.(*raft).handleFollowerHeartbeat(_, {0x11, 0x4, 0x1, 0x3, 0x3, 0x0, 0x0, 0x8, 0x0, ...})
        /Users/zohaib/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/internal/raft/raft.go:1933 +0x85
github.com/lni/dragonboat/v3/internal/raft.defaultHandle(_, {0x11, 0x4, 0x1, 0x3, 0x3, 0x0, 0x0, 0x8, 0x0, ...})
        /Users/zohaib/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/internal/raft/raft.go:2098 +0x95
github.com/lni/dragonboat/v3/internal/raft.(*raft).Handle(_, {0x11, 0x4, 0x1, 0x3, 0x3, 0x0, 0x0, 0x8, 0x0, ...})
        /Users/zohaib/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/internal/raft/raft.go:1483 +0x275
github.com/lni/dragonboat/v3/internal/raft.(*Peer).Handle(_, {0x11, 0x4, 0x1, 0x3, 0x3, 0x0, 0x0, 0x8, 0x0, ...})
        /Users/zohaib/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/internal/raft/peer.go:195 +0x185
github.com/lni/dragonboat/v3.(*node).handleReceivedMessages(0xc0007de000)
        /Users/zohaib/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/node.go:1275 +0x358
github.com/lni/dragonboat/v3.(*node).handleEvents(0xc0007de000)
        /Users/zohaib/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/node.go:1133 +0x73
github.com/lni/dragonboat/v3.(*node).stepNode(_)
        /Users/zohaib/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/node.go:1111 +0x115
github.com/lni/dragonboat/v3.(*engine).processSteps(0xc00012e640, 0xc000ba9da8?, 0xc00086fe38?, 0xc000194cf0, {0x5354260, 0x1?, 0x0}, 0xc000183bc0?)
        /Users/zohaib/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/engine.go:1279 +0x25d
github.com/lni/dragonboat/v3.(*engine).stepWorkerMain(0xc00012e640, 0x4)
        /Users/zohaib/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/engine.go:1215 +0x2be
github.com/lni/dragonboat/v3.newExecEngine.func1()
        /Users/zohaib/go/pkg/mod/github.com/lni/dragonboat/v3@v3.3.5/engine.go:1017 +0x68
github.com/lni/goutils/syncutil.(*Stopper).runWorker.func1()
        /Users/zohaib/go/pkg/mod/github.com/lni/goutils@v1.3.0/syncutil/stopper.go:79 +0xc5
created by github.com/lni/goutils/syncutil.(*Stopper).runWorker
        /Users/zohaib/go/pkg/mod/github.com/lni/goutils@v1.3.0/syncutil/stopper.go:74 +0xea

Process finished with the exit code 2

Steps to reproduce the behavior

Have a 3+ nodes (and not require but 3+ shards)
Start with StartOnDiskCluster on all nodes, and member list of each node pointing to other nodes (i.e. fully connected)
Do some commits on cluster so that index moves forward (just not enough to cause a snapshot)
Bring one of the nodes down
rm -rf the raft directory so that it looses all logs but not the underlaying disk persistent cluster
Start the node again, and repeat rm -rf step couple of times (2 or 3rd attempt) it should crash

lni commented 1 year ago

Thanks for bring up the issue. What you have experienced is a common mistake by users so I'd like to offer a detailed explanation on this.

In a consensus protocol, typically, logs have to be persistently stored before an acknowledgement can be sent to confirm that the log has been successfully replicated onto a node. Such such persistency means that once acknowledged the log must be there as long as the node (with its intended ID) is there. This is required so whichever node trying to replicate the log can have a strong assurance that the replica exists on the expected remote node, even after reboots. Note that it is totally fine to destroy that node altogether - it is totally ok to have both the node and its data gone.

This is actually explained in the doc/devops.md. You may want to have a read on that doc first.

When you manually remove those directories, it basically violates the above mentioned assurance. You can also look at this whole issue from a different angle - your node responded to the leader that replicated logs were all persistently stored, but all logs disappeared after the reboot because they were deleted. From leader's point of view, it got tricked by that node - the acknowledgement became totally unreliable as the replicated content can disappear at anytime after such acknowledgement. In short, using computer science terms, you basically injected byzantine fault into a typical non-byzantine system.

Dragonboat ensures its data integrity, but as a fully decentralized approach, sadly there is nothing it can do to automatically tell whether it is a fresh new node without any data or it is a restarted node with all its data deleted. Please just think about the simplest deployment where you just have one replica in the shard (cluster in v3.0's term).

lni commented 1 year ago

More details on why there is nothing dragonboat can do to detect such "someone deleted all my data" incident -

For shard with a single replica, obviously there is nothing you can do once someone can delete all data that belong to the replica. on startup, it is just a node with no data, there is absolutely no way to tell whether there is no data because it is brand new or because someone deleted all its data.

For shard with say 3 replicas, which is probably most common setting, let's say there are nodes A, B and C. Assuming A is the leader and it is replicating logs to B without any issue, this is enough to have all logs committed and applied as a quorum of two is met (nodes A and B). Let's say C is isolated from A and thus without any knowledge of the committed logs from A. Again, this is totally fine, it is the exact feature we want from raft/paxos. Now let's say A is totally destroyed, the machine & its stored logs are just gone. This is still fine as we all data on B, on startup, it should be elected as leader and start replicating its logs to C. However, if we manually remove all data stored by B and restart node B, this is going to be a problem - both B and C now know nothing about committed logs, both of them would believe that B is a brand new node. There is simply no way to tell something really bad (B's data got removed and its persistent storage assurance got violated) happened.

To largely prevent such issue - your application (not dragonboat) can use a separated storage to store some metadata on whether the node is brand new or not. As described in Paxos Made Live (linked below, see section 5.1), google did this by placing a marker in GFS (which is a totally different storage service from where/how logs were stored) once a node is first booted. On restart, if there is no data but a marker in GFS, you can tell that something bad happened.

https://static.googleusercontent.com/media/research.google.com/en//archive/paxos_made_live.pdf

Dragonboat can't do this automatically - there is no such separate storage known & accessible by dragonboat whatsoever.

The above google approach largely solves the problem, but it doesn't help when you can also remove the marker - again, see the examples above. When you assume that someone has the ability to delete all data relevant to a node, there is no way that node can tell whether it is brand new or not.

The bottom line here is dead simple - Raft/Paxos is a non-byzantine system, when a node confirms that certain logs have been persistently stored on that node, such data must be persistently stored on that node. For whatever excuses/reasons, if nodes are allowed to cheat on such confirmation, your system will be in a big big trouble.

sprappcom commented 2 weeks ago

@lni i would like to use dragonboat ini ondisk for production. possible to fix it so that deleted node / starting a new one will automatically recover based on some sort of production ready mechanism?

it's quite obvious that u shldnt need a backup of those logs since this is a clustered implementation. it's assumed that one node will fail with disk corruption "eventually".

so the question is:

@lni can u pls help make ondisk example production ready with the failure scenario as mentioned so we can all use it in production without worry? this is an expected working feature for the core of dragonboat.

appreciate this.

lni commented 2 weeks ago

@sprappcom this is NOT a bug, it is not something anyone can fix.

sprappcom commented 2 weeks ago

@lni are u saying etcd doesnt solve this issue too?

how can this kind of system be used in production if one node's disk is corrupted?

lni commented 2 weeks ago

Hi @sprappcom

It is not really an issue in the first place, it is how Paxos/Raft suppose to work. Please read the explanation provided above again, particularly for that concrete example provided (pasted below again for your connivence).

For shard with say 3 replicas, which is probably most common setting, let's say there are nodes A, B and C. Assuming A is the leader and it is replicating logs to B without any issue, this is enough to have all logs committed and applied as a quorum of two is met (nodes A and B). Let's say C is isolated from A and thus without any knowledge of the committed logs from A. Again, this is totally fine, it is the exact feature we want from raft/paxos. Now let's say A is totally destroyed, the machine & its stored logs are just gone. This is still fine as we all data on B, on startup, it should be elected as leader and start replicating its logs to C. However, if we manually remove all data stored by B and restart node B, this is going to be a problem - both B and C now know nothing about committed logs, both of them would believe that B is a brand new node. There is simply no way to tell something really bad (B's data got removed and its persistent storage assurance got violated) happened.

sprappcom commented 2 weeks ago

For shard with say 3 replicas, which is probably most common setting, let's say there are nodes A, B and C. Assuming A is the leader and it is replicating logs to B without any issue, this is enough to have all logs committed and applied as a quorum of two is met (nodes A and B). Let's say C is isolated from A and thus without any knowledge of the committed logs from A. Again, this is totally fine, it is the exact feature we want from raft/paxos. Now let's say A is totally destroyed, the machine & its stored logs are just gone. This is still fine as we all data on B, on startup, it should be elected as leader and start replicating its logs to C. However, if we manually remove all data stored by B and restart node B, this is going to be a problem - both B and C now know nothing about committed logs, both of them would believe that B is a brand new node. There is simply no way to tell something really bad (B's data got removed and its persistent storage assurance got violated) happened.

A's node get destroyed in a fire.

how do i create D to replace A?

ondisk example doesnt work this way

lni commented 2 weeks ago

When replica A got destroyed in a fire, you just bring up another replica and name it as replica D, you can't name the new replica as A because that will confuse all other replicas - those replicas expect A to have all data up to certain previously acknowledged point, but the new replica being used to replace A doesn't have such data, from other replica's point of view, if that new replica used to replace A call itself A, then all A's previous acknowledgements are no longer dependable or trustworthy anymore, everything got stuck.

Please read the Paxos/Raft if you want to use it in production, all details are in the paper.

sprappcom commented 2 weeks ago

@lni ok i "get it" now. so we still have to make backups of the replica's replica. gosh. this really defeats the purpose of this raft/paxo thing.

if i save the state of "A" from the initial initialization, can i use this copy for the "recovery" after let's say 10 years into using it for production?

i tot there must surely be a way to inform B and C that A is destroyed and A can be "revived" from ashes.

if A and B gets destroyed at the same time, i can only create D and E to replace it using C only right?

lni commented 2 weeks ago

@sprappcom

Please read the raft paper and the mentioned doc/devops.md doc to have a better understanding on the protocol itself.

lni / dragonboat