Open fahadullah opened 8 years ago
Is this still an observed issue on the current version?
Running the same test routine pointed by @fahadullah with some small modifications to adequate to the new API, I've reached the same "log not found" state. At first, I thought there could be a bug on InmemStore DeleteRange(), which is the implementation of LogStore interface used on unit tests. After some time, I've came back to the test definition and I think it's equivocated.
Please correct me if I'm wrong, but the expected log index after a successfully snapshot installation, and entire log truncation, shouldn't be zero? Not 102 as was currently defined.
I encountered same issue on my own implementation.
In the current hashicorp/raft implementation, "log not found" may occur under unreliable network.
So I make some modifications in the appendEntries
RPC handler.
Original:
https://github.com/hashicorp/raft/blob/185ae2ea91f09c04428c4293641db142b130a631/raft.go#L1476-L1502
After:
if args.PrevLogIndex > 0 {
lastIdx, lastTerm := rf.getLastEntry()
lastSnapshotIdx, lastSnapshotTerm := rf.getSnapshot()
var prevLogTerm uint64
if args.PrevLogIndex == lastIdx {
prevLogTerm = lastTerm
if args.PrevLogTerm != prevLogTerm && args.PrevLogIndex == lastSnapshotIdx && args.PrevLogTerm == lastSnapshotTerm {
// We only compact the log range from `firstLogIndex` to `Min(lastLogIndex, lastSnapshotIndex - trailingLogs)`.
// If `trailLogs` is not 0, the log entry on `lastSnapshotIndex` is not included in the log range to be compacted.
// And sometimes the log entry on `lastSnapshotIndex` is the last log entry, but with a different term.
// So we do this additional check
prevLogTerm = lastSnapshotTerm
}
} else if args.PrevLogIndex == lastSnapshotIdx && args.PrevLogTerm == lastSnapshotTerm {
// If there are remaining logs after the snapshot index, the `lastIdx` will be the last log index.
// Without this additional check, even if the log entry on lastSnapshotIdx == args.PrevLogIndex exists semantically;
// However, the log entry on index `lastSnapshotIdx` has been compacted, so we need to use the snapshot term as the term of compacted log on index `lastSnapshotIdx`.
prevLogTerm = lastSnapshotTerm
} else {
var prevLog LogEntry
if err := rf.logs.GetLog(args.PrevLogIndex, &prevLog); err != nil {
rf.logger.Debug("failed to get previous log",
zap.Int("receiver-id", rf.me),
zap.Uint64("previous-index", args.PrevLogIndex),
zap.Uint64("last-index", lastIdx),
zap.Error(err))
reply.NoRetryBackoff = true
return
}
prevLogTerm = prevLog.Term
}
if args.PrevLogTerm != prevLogTerm {
rf.logger.Debug("previous log term mis-match",
zap.Int("sender-id", int(args.LeaderId)),
zap.Int("receiver-id", rf.me),
zap.Uint64("ours", prevLogTerm),
zap.Uint64("remote", args.PrevLogTerm))
reply.NoRetryBackoff = true
// Find the last log that matches the term
commitIdx := rf.getCommitIndex()
conflictTerm := prevLogTerm
var idx uint64
for idx = args.PrevLogIndex - 1; idx >= commitIdx; idx-- {
var log LogEntry
if err := rf.logs.GetLog(idx, &log); err != nil {
rf.logger.Debug("failed to get log",
zap.Int("receiver-id", rf.me),
zap.Uint64("index", idx),
zap.Error(err),
)
return
}
if log.Term != conflictTerm {
break
}
}
reply.LastLog = idx
return
}
}
@slackpad Sorry for disturbing you, I guess you are the maintainer of this repo. Would like to hear your thoughts on this issue, Thx!
https://gist.github.com/fahadullah/e49a4578094f4efd7746
If TrailingLogs is set to 0 and raft node is restarted right after taking a snapshot, it tries to look up the last index but all entries were already compacted. TrailingLogs being 0 should be a valid config though a little inefficient under certain cases.