goraft / raft

UNMAINTAINED: A Go implementation of the Raft distributed consensus protocol.
MIT License
2.43k stars 480 forks source link

Fix file issue #150

Closed xiang90 closed 10 years ago

xiang90 commented 10 years ago

@ongardie For our raft implementation, we do fsync every time the follower append log entries from followers. But we do not do fsync every time the leader receive a command from the clients. I observer that leader does not need to do a fsync before actual committing. This is because the leader will decide when to commit the log entry. Thus the leader can safely assume that itself has the log entry no matter the log has been fsynced or not. Fsync every time the leader receiving a command will affect performance largely. Do you think it is a right observation?

/cc @baruch @benbjohnson

benbjohnson commented 10 years ago

@xiangli-cmu I don't think I quite understand the PR comment. Do you want to remove the Sync() on the log when the leader is committing a command?

xiang90 commented 10 years ago

@benbjohnson

  1. The util.go and using rename to replace the log is cherry-picked from @baruch's pull request.
  2. Actually we only do a fsync after the follower append its entries. We do not fsync when the leader append its entries. I do not think the leader needs to do that. I think the leader only needs to do this before commit entries, so it can fsync entries in a batch.
benbjohnson commented 10 years ago

@xiangli-cmu Technically I think it's supposed to fsync before it sends to the followers. Although you might be right. If the entry hasn't been committed by the leader then it's not externally visible so it might be ok if it gets lost. It'll be committed by the next leader.

You still need to fsync the followers after each append so this might just end up making the followers a lot slower.

I'm ok with adding configurable options for fsync. Some people may want to trade off a little safety for a hefty performance boost. It should be strict by default though.

xiang90 commented 10 years ago

@benbjohnson It should be safe since we only need to make sure at the committing point, the log entries are safely stored on the majority of the nodes in the cluster.

The thing is that

  1. We send entries in a batch to follower. So when there are high traffic, the follower will fsync appendEntires to disk in a batch. Will not hurt performance too much.
  2. If we fsync before it sends to follower, we still are going to fsync multiple times.

My hope is to do heartbeat in a single go-routine , and broadcast in multiple go-routines. In that way, we can

  1. Simplify the current approach.
  2. Find a way to reuse the serialized results, since most followers should be at the same pace.
  3. We can do fsync for leader each heartbeat.
philips commented 10 years ago

@benbjohnson +1 on making fsync something we can disable by default.

Perhaps I am being naive but fsync only helps protect against data loss in a full cluster down, right? In the normal case where the current leader has a power failure but a majority of the cluster stays up then it doesn't matter if goraft is fsync'ing because the cluster will have moved on and the correctness of the leader's log is irrelevant once it rejoins.

Just trying to wrap my head around why we are fsyncing the log at all.

ongardie commented 10 years ago

I haven't thought much about this, but I think @xiangli-cmu is right that until the leader externalizes entries (by telling followers or clients that they're committed), it's not required to retain them. As far as Raft is concerned, it's ok to lose uncommitted entries.

OTOH, I think @philips's comment takes it too far -- I don't think you can get rid of all fsyncs safely. If an entry /is/ committed, then you have to maintain the invariant that a majority of the servers stores the entry. Otherwise, there's no way to guarantee that they'll be available when it comes to electing a new leader.

xiang90 commented 10 years ago

@philips As @ongardie suggests, if we disable fsync, then we are totally rely on the page cache of OS. Both power failure on leader and the the whole cluster might cause us to lose committed data.

benbjohnson commented 10 years ago

@xiangli-cmu lgtm. Can you clean up the writeFileSynced() and go ahead and merge.

xiang90 commented 10 years ago

@benbjohnson Will do.