sofastack / sofa-jraft

A production-grade java implementation of RAFT consensus algorithm.
https://www.sofastack.tech/projects/sofa-jraft/
Apache License 2.0
3.56k stars 1.14k forks source link

Leader election with 2 nodes #1000

Closed asad-awadia closed 1 year ago

asad-awadia commented 1 year ago

If we have 2 nodes in a raft cluster and the new node [node 2] goes down - can the new node then become the leader when it comes back up even though it has less transactions than the existing node [node 1]?

Since there are only 2 nodes - is there a guarantee that the node that has the highest transaction id will always be elected the leader in case of a node restarting?

We are seeing

  1. Existing 1 node cluster with node 1 as the leader
  2. New node 2 gets added
  3. Node 1 and 2 are healthy
  4. Node 2 goes down
  5. Node 1 is in candidate state
  6. Node 2 comes back
  7. Node 2 becomes leader

Even though node 2's transaction IDs < node 1's transaction id

Shouldn't node 1 always be elected the leader?

Environment

fengjiachun commented 1 year ago

Hi, you may need to read the raft paper first, after having a clear understanding of raft, this problem will be solved.

asad-awadia commented 1 year ago

@fengjiachun thanks for the non-answer

we are scaling up to 3 nodes - but servers are added 1 at a time - so this is a case that can happen often

asad-awadia commented 1 year ago

just to be clear I am saying that this is what we are seeing with Sofa Jraft - where node 2 with fewer transactions is becoming the leader

fengjiachun commented 1 year ago

In a Raft cluster, the election of a leader does not depend on the number of transactions a node has processed, but rather on the term and the log completeness.

Here's a simplified explanation of how leader election works in Raft:

Each time a node restarts or a leader node fails, a new term begins. The node that detects the failure will increment its term and start an election.

The node starting the election (the "candidate") requests votes from the other nodes. The candidate includes its term and the index and term of the last entry in its log in the request.

The other nodes grant their vote if they haven't voted yet in this term and if the candidate's log is at least as up-to-date as their own log. "At least as up-to-date" is determined by comparing the index and term of the last entries in the logs. A log with a more recent entry is more up-to-date.

If the candidate receives votes from a majority of the nodes, it becomes the leader.

In your case, when node 2 comes back up, it can become the leader if it starts a new term and its log is at least as up-to-date as the log of node 1. If node 2's log is not as up-to-date, node 1 will not grant its vote and node 1 will remain the leader.

However, in a two-node cluster, it's important to note that if one node goes down, the remaining node cannot make progress because it can never get votes from a majority. This is why it's recommended to have an odd number of nodes in a Raft cluster, typically three or five.

asad-awadia commented 1 year ago

number of transactions a node has processed, but rather on the term and the log completeness.

yes, that is what i mean by transactions => getNode().getLogManager().getLastLogIndex();

If node 2's log is not as up-to-date, node 1 will not grant its vote and node 1 will remain the leader.

So this part is not happening

Node 2's getNode().getLogManager().getLastLogIndex(); was 7 and node 1's was 528 - and yet node 2 became the leader

However, in a two-node cluster, it's important to note that if one node goes down, the remaining node cannot make progress because it can never get votes from a majority.

Yes - we understand - but to get to three we need 2 nodes first since we go 1 at a time

fengjiachun commented 1 year ago

yes, that is what i mean by transactions => getNode().getLogManager().getLastLogIndex();

Not only LogIndex.

5.4.1: https://raft.github.io/raft.pdf

Raft determines which of two logs is more up-to-date
by comparing the index and term of the last entries in the
logs. If the logs have last entries with different terms, then
the log with the later term is more up-to-date. If the logs
end with the same term, then whichever log is longer is
more up-to-date.

If it's easy to reproduce the situation you described, can you tell me in detail how to reproduce it?

asad-awadia commented 1 year ago

So I should check these two values of the two nodes? lastLogIndex and term - to see if the new node is incorrectly becoming the leader

            Node node = raftManager.get().getNode();
            long lastLogIndex = node.getLogManager().getLastLogIndex();
            long term = node.getLogManager().getTerm(lastLogIndex);

If it's easy to reproduce the situation you described, can you tell me in detail how to reproduce it?

It is easy to reproduce in our prod environment - in a local reproducer i am sure it will work as expected :(

    Existing 1 node cluster with node 1 as the leader
    New node 2 gets added - add new node via ` node.addPeer(peerToBeAdded)`
    Node 1 and 2 are healthy
    Node 2 goes down/_restarted with a blank disk/ssd_
    Node 1 is in candidate state
    Node 2 comes back
    Node 2 becomes leader

Hence why i am looking for some guidance on where to start debugging this from

Could it be my initial configuration?

How can the new node have a higher term? It just started up - i will check the term values tomorrow

something is wrong here - but not sure where to look for it

fengjiachun commented 1 year ago

Node 2 goes down/restarted with a blank disk/ssd

What does this mean? Does it mean the data has been deleted? If the data has been deleted, what configuration did you use to start node2, does it know the peer of node1? If it considers itself the only node, it will become the leader.

For moe information, you can send a signal to get the describe info of node2 of jraft by kill -s SIGUSR2 pid. Then a file will be generated at the java program working directory node_describe.log

You can find more info at jraft user guide 9. Troubleshooting Tools:

https://www.sofastack.tech/en/projects/sofa-jraft/jraft-user-guide/

shihuili1218 commented 1 year ago

Node 2 goes down/restarted with a blank disk/ssd Node 1 is in candidate state

At this point, the cluster configuration consists of two members, with the majority still being 2. Node 1 cannot process transactions, so the logIndex of Node 1 is the same as before the failure of Node 2.

Node 2 comes back Node 2 becomes leader

At this point, the term and logIndex of Node 1 and Node 2 are the same. logindex has been explained above, term because: sofajraft has preVote, node1 does not increment term

asad-awadia commented 1 year ago

Does it mean the data has been deleted?

Yes

what configuration did you use to start node2, does it know the peer of node1?

yes - it uses a config of <node1,node2>

asad-awadia commented 1 year ago

@fengjiachun kill -s SIGUSR2 pid is very useful - here is what i am seeing

node 2's node describe

state: STATE_LEADER
term: 8
conf: ConfigurationEntry [id=LogId [index=4, term=8], conf=node2:8083,node1:8082, oldConf=]

logManager:
  storage: [1, 9]
  diskId: LogId [index=9, term=8]
  appliedId: LogId [index=9, term=8]
  lastSnapshotId: LogId [index=0, term=0]

node 1's describe

state: STATE_FOLLOWER
term: 8
conf: ConfigurationEntry [id=LogId [index=523, term=7], conf=node1:8082,node2:8083, oldConf=]

logManager:
  storage: [1, 524]
  diskId: LogId [index=524, term=7]
  appliedId: LogId [index=524, term=7]
  lastSnapshotId: LogId [index=491, term=4]

so the term of node 2 is higher? why though? it just started up with 0 data in its disk [this happens when node 1 is in candidate state]

fengjiachun commented 1 year ago

Node2's data was deleted, but node1 still kept some of node2's state(You can see them from the describe file.), which caused some inconsistency. The safest approach is, if you are just temporarily stopping node2, then you should not delete the data. If you want to permanently delete node2, you can use remove_peer.

asad-awadia commented 1 year ago

but node1 still kept some of node2's state(You can see them from the describe file.),

Can you point to where this is?

fengjiachun commented 1 year ago

but node1 still kept some of node2's state(You can see them from the describe file.),

Can you point to where this is?

For example, the following information (You can only view it while it is still the leader, or you can view node2's):

replicatorGroup: 
  replicators: ...
  failureReplicators: ...

Finally, emphasize again, do not delete data, the problems that appear after deleting data are unpredictable. If you want to remove a node, use remove_peer, which is based on raft membership change protocol.

asad-awadia commented 1 year ago

do not delete data,

Yes - I understand

Thank you for you help!