goraft / raft

UNMAINTAINED: A Go implementation of the Raft distributed consensus protocol.
MIT License
2.43k stars 480 forks source link

Command#Apply is passed inconsistent context information. #239

Open jrallison opened 10 years ago

jrallison commented 10 years ago

Hey all,

I'm using raft for a project, and using context.CurrentIndex() in my command's Apply function for some bookkeeping.

I noticed, in different circumstances, the context.CurrentIndex() is inconsistent. For instance, committing 3 entries, in succession against a new 1-node cluster, results in the following contexts passed to the Apply function:

Term: 0 CurrentIndex: 3 CommitIndex: 3
Term: 0 CurrentIndex: 4 CommitIndex: 4
Term: 0 CurrentIndex: 5 CommitIndex: 5

Killing and restarting the node will force it to recover from the log. While recovering, we receive the following contexts as the node replays the committed entries:

Term: 0 CurrentIndex: 5 CommitIndex: 3
Term: 0 CurrentIndex: 5 CommitIndex: 4
Term: 0 CurrentIndex: 5 CommitIndex: 5

This is due to how the internalCurrentIndex log method is computed.

I've added a failing test, and adjusted the code to always return the index of the entry for context.CurrentIndex() (which is the current index we're applying to the state machine).

Let me know if anything here is unclear, or if I've misunderstood the meaning of context.CurrentIndex().

jrallison commented 10 years ago

Note that using the CommitIndex value isn't usable either. As a new node added to the cluster will receive the following information as it receives committed entries from the existing node.

Term: 0 CurrentIndex: 3 CommitIndex: 5
Term: 0 CurrentIndex: 4 CommitIndex: 5
Term: 0 CurrentIndex: 5 CommitIndex: 5

Which seems to make sense. The change in this pull request unifies the context information, so that it's consistent in all cases.

otoolep commented 10 years ago

I don't know for sure either, but it's not clear to me why you think this is wrong behaviour. Couldn't CurrentIndex be interpreted as the next place a log entry will be written, and that the recovery case you outline is an exceptional case where CurrentIndex is quite a bit ahead of where the next log message will be written? But once recovery is complete, everything is OK?

Does this behaviour prevent you from building what you need to build?

jrallison commented 10 years ago

Hey @otoolep, it is as I cannot safely know which commit I'm currently applying against my state machine. I'm using raft to manage consensus for some external data rather than just an in-memory state machine, and knowing whether or not I've already processed a given commit in the past is important for correctness.

At any given time, I'm applying a single commit to the state machine on a given node. With the current implementation, a specific entry can be applied (or re-applied in the case of a node restart, etc) with totally different CurrentIndex and CommitIndex values.

CommitIndex is the one that seems to be the next place a new log entry will be written from the perspective of a given node, if I understand correctly.

At the very least, this makes it difficult to program against, or use in any significant fashion, as there doesn't seem to be any firm definition of what to expect from CurrentIndex.

otoolep commented 10 years ago

Hang on -- why are you worrying if a particular commit has been applied in the past? Shouldn't a recovery commence with restoring the data source under that node from snapshot, and then applying any outstanding log entries? Is it not possible to wipe out the data source, and then apply a snapshot (or simply replay the entire log?

I know this doesn't directly address your original question, but I am not sure I follow why you have issue in the first place. Perhaps you can't snap-and-restore your external data source? I do exactly this with rqlite and it works fine. I am interested in learning more about your application, to see if I can help.

jrallison commented 10 years ago

We're managing consensus of insert ordering for a time-based stream of data that grows without bound, and doesn't fit in memory (think terabytes of data). We're using raft to manage the "tail" of the stream, so that each node can be confident that it is writing a consistent ordering of the data.

We do use snap-and-restore to essentially truncate the raft log at certain points. Since our data is immutable once written, at regular intervals we record the current index, take a snapshot which contains the current index applied, etc), and archive the block of data written since the last snapshot.

In the case of a new node added to the cluster, or a node being down for multiple snapshots, the nodes have ways to restore missing archived blocks outside of raft (think something like https://github.com/golang/groupcache).

The issue becomes, since we're marking the current index for these snapshots and we have no safe way of knowing which index we've just applied, we end up with the potential for duplicated data when a node is recovering, hence causing inconsistency between nodes of the underlying data.

If I'm thinking about this wrong, or using raft incorrectly, I'd love to hear your thoughts.

In our testing, simply being able to keep track of the last index we've applied to our underlying data is enough to prevent this duplication.

otoolep commented 10 years ago

OK, I think I follow you. In general, if a snapshot of your system does not contain the complete state, I think I see how you could encounter problems. I need to think about it a bit more to be sure.

But it's not 100% clear to me from the source if there is a bug. It may just be the way that variable is meant to work.

jrallison commented 10 years ago

Thanks for the feedback. Perhaps another approach would be to add a new context field with the... index of the current command being applied.

Would love to get a better understanding of the intended meaning of the CurrentIndex/CommitIndex information.

otoolep commented 10 years ago

Yeah, @jrallison I think enhancing the information passed to Apply() might be something worth trying.