Replicator consecutiveErrorTimes enhancement

sofastack / sofa-jraft

A production-grade java implementation of RAFT consensus algorithm.

https://www.sofastack.tech/projects/sofa-jraft/

Apache License 2.0

3.56k stars 1.14k forks source link

Replicator consecutiveErrorTimes enhancement #197

Closed JohnsonYu closed 5 years ago

JohnsonYu commented 5 years ago

Your question

Can we add a consecutiveErrorTimes time check if time is over some config, automatically destory replicator or remove it from ReplicatorGroup ?

Your scenes

Only in election scenario.

Your advice

add new argument in ReplicatorOptions. (default 0: mean uncheck)?
compare value in Replicator::onHeartbeatReturned()
then do remove?

Environment

SOFAJRaft version: 1.2.5
Env configure Apache Maven 3.0.5 (Red Hat 3.0.5-17) Maven home: /usr/share/maven Java version: 1.8.0_212, vendor: Oracle Corporation Java home: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.212.b04-0.el7_6.x86_64/jre Default locale: en_US, platform encoding: UTF-8 OS name: "linux", version: "3.10.0-514.10.2.el7.x86_64", arch: "amd64", family: "unix"

killme2008 commented 5 years ago

Why do you need this feature?

The replicator represents a follower in raft group,if it's disconnected, we should try to connect it forever. If you want to remove a follower ,you can use CliService#removePeer(peer).

JohnsonYu commented 5 years ago

@killme2008 Thanks comments. so my question is: How can I monitoring the cluster change? Does the jraft has interfaces to user to handle node up or down event?

fengjiachun commented 5 years ago

@JohnsonYu I think you can use the following event notifications in combination to monitor the cluster. StateMachine#onLeaderStart StateMachine#onLeaderStop StateMachine#onStartFollowing StateMachine#onStopFollowing

JohnsonYu commented 5 years ago

@fengjiachun I think I got your point. In current, jraft left the strategy to end-user. but this feature maybe regard as an enhancement to the Topology level users. Just open a config to user, when node down over retry times remove it from federation.

fengjiachun commented 5 years ago

@JohnsonYu

Automatic removal of a node may cause the cluster to become unavailable and can't automatically recover, then may cause consistency errors.

I think we shouldn't do the removal work on any raft nodes, we should make decisions outside the raft cluster, either manually or automatically. You can do this with CliService

@killme2008 What is your idea?

killme2008 commented 5 years ago

@fengjiachun I think so, we should not remove a node automatically, it should be triggered by user. Instead, maybe we can provide a ReplicatorStateListener:

interface ReplicatorStateListener {
  void on oStarted();
  void on onError();
  void on onStopped();
}

The users can monitor the replicator state by implementing this interface and register it to the leader.

JohnsonYu commented 5 years ago

I agree with you guys that raft system should let user do removal. But from user prospect, how do they know some node is unreachable at anytime in cluster. How about this way:

I want set a threshold value to the jraft system if node down over any time(such as 10s or retry times whatever), jraft system just raise this event in StateMachine: void onNodeFailureOverRetryTimes(PeerId peer); the jraft system can keep connect it forever as current. But user can handle this event, do remove or other actions through CliService?

Basically all I hoped is jraft system can let me know any node is down and it's over my safety time.

zongtanghu commented 5 years ago

I can try my best to implement this demand. @killme2008 @fengjiachun

fengjiachun commented 5 years ago

@fengjiachun 👌，tks

zongtanghu commented 5 years ago

Hi @JohnsonYu ,you can see this pr(https://github.com/sofastack/sofa-jraft/pull/256).