Open banks opened 1 year ago
This would be very useful, as rqlite (which uses this library) can also encounter long start up times. Knowing which logs, which are not contained in the latest snapshot, are actually committed, would solve this problem. I could optimize replay at start-up.
Is there anyway to detect which are committed at this time? I may modify my own fork of Raft if needed, until this repo makes your suggested improvement (or something like it). It doesn't to be perfect, even knowing up to, say, 75%, of the what logs were definitely committed would help. So calling SetCommitIndex even every so often would be help.
Though I might just add a goroutine to my code, which polls raft.AppliedIndex() every few seconds. It would write it to the disk. It doesn't have to be up-to-date, just enough to make a difference to start-up times, but primarily to make sure I don't go too far in the Raft log during replay, and apply uncommitted entires.
@otoolep sorry I didn't see your comments here until now.
Yeah what you proposed would work too either inside the library or from the outside. It would be easy enough to trade off how stale things might be on startup vs, how frequently you write to disk but even writing to disk once per second would have minimal impact on modern SSDs but still bound the staleness after start up to just 1 second.
If you do try this please update here so we can hear from your experience. I'd still like the library to handle this eventually and it's not that difficult to do but no idea when we'd be able to prioritize that.
Yes, that's what I did for a while -- added a goroutine that wrotes the Applied Index to BoltDB every few seconds. Then, at start-up, rqlite was putting its SQLite database in a "fast-update" mode while applying logs I knew for a fact were Applied (using the index retrieved from BoltDB at startup). Once that was done, rqlite would change its SQLite instance back to more conservative update strategy (which results in slower writes however).
However I've since backed out the change, and rqlite now just runs SQLite in the "fast update" mode (PRAGMA sychronous=off
) all the time. I did this because the SQLite docs state that the worst that could happen if rqlite was to crash, with SQLite in this mode, would be a corrupt SQLite database. But since rqlite deletes and rebuilds the entire DB from scratch on restart, this potential for corruption is a non-issue. So I've got fast start-up times now in all circumstances.
TBH I should have run SQLite in PRAGMA sychronous=off
from the very start of my project's development, but wasn't aware of it at the time.
@banks Hi, I am interested in this, and also I can help on this.
- Introduce a new optional interface called something like:
type CommitTrackingLogStore interface { SetCommitIndex(idx uint64) error ReadCommitIndex() (uint64, error) }
- Have Raft detect that the
LogStore
also implements this just before callingStoreLogs
during replication or leader'sdispatchLogs
, if it does, callSetCommitIndex
first with the most recent commitIndex we know- Implementations may store that index in memory until the next
StoreLogs
call and then write it out on disk along with the log data.- When the
LogStore
recovers it's state on open, it would recover the last persisted commit index to and load it into memory- When Raft starts up it can restore snapshot like now then...
- Check if the
LogStore
implementsCommitTrackingLogStore
. If it does, load the commit index from the store and replay all logs fromlastSnapshotIndex
up to and including the commit index that was persisted, before completing startup.
Maybe we can add a feature flag called --fast-recover
and some warning to show the impact, then in step 2 and step 6, the conditions also check if the flag is enable, if both is ok, then...
@lalalalatt great! Yeap a feature flag (or optional config in this library that apps can enable with a flag) would be a great step to making it more safe to roll out a larger change like this.
If you'd like to work on it that's awesome. I'd like to see if done. I do want to be transparent that I'm not sure how quickly we could commit to reviewing or merging. Large changes to raft are allways full or risk and need close attention even when they are conceptually simple and it can be hard to prioritize a relatively large amount of review or testing on a PR that is not something that is directly impacting users too much right now. That's not to say we're not interested, but just hoping to be up-front rather than let you do a lot of work and not be able to get it merged any time soon!
If you still would like to work on this, I suggest breaking it down into bits and proposing (ideally before spending lots of time writing) a few different PRs that we can review individually as a branch - you'll need to PR changes to the LogStore implementations anyway separately.
I think we could contribute to https://github.com/hashicorp/raft-boltdb after finishing all milestones in this issue.
Today I realised my mental model of how our Raft library works is subtly wrong.
I assumed that when a node restarts, it loads the most recent snapshot into the FSM and replays committed logs from disk into the FSM before rejoining the cluster.
In fact, we only load from snapshot since currently we don't persist any information about which logs are known to be committed which means that replaying them all might apply a few uncommitted ones from the end of the log.
In practice this works OK in most cases because if the node is elected leader then it will write a new log which as soon as it's committed will trigger applying all the other committed logs in to the FSM too. If it's a follower it doesn't catch up it's FSM until the first
AppendEntries
RPC from the leader tells it the current commit index in the cluster.As we consider reducing snapshot frequency with WAL LogStore being able to store lots more logs performantly, this will become more of an issue because it will mean followers are much more "stale" when they start serving requests.
To fix this correctly, we'd need to persist the last known commit index so that we know which prefix of the logs it's safe to apply on startup. We could periodically persist that to the
StableStore
but that increases amount of disk transactions and trades that off against how recent the index stays. Currently StableStore performance is not an issue because it's only updated infrequently during elections. This would make it much more critical.An alternative that would be more efficient but a little more complicate would be:
LogStore
also implements this just before callingStoreLogs
during replication or leader'sdispatchLogs
, if it does, callSetCommitIndex
first with the most recent commitIndex we knowStoreLogs
call and then write it out on disk along with the log data.LogStore
recovers it's state on open, it would recover the last persisted commit index to and load it into memoryLogStore
implementsCommitTrackingLogStore
. If it does, load the commit index from the store and replay all logs fromlastSnapshotIndex
up to and including the commit index that was persisted, before completing startup.