More raft state related callbacks

lni commented 3 years ago

Provide callbacks on more raft states. Need to check with users to see what they need.

fly3366 commented 3 years ago

Need some hook. eg. OnNodeNumberHasChange/OnLeaderHasChange/OnClusterReady.

SergeyLysanov commented 3 years ago

OnBecameUnreachable(raftio.NodeInfo)
OnBecameAvailable(raftio.NodeInfo)

In other words some API to understand available/unreachable nodes in cluster from RAFT point of view.

intrntbrn commented 2 years ago

There is SendSnapshotStarted and SendSnapshotCompleted, but there is only SnapshotReceived.

Having both events (ReceiveSnapshotStarted and ReceiveSnapshotCompleted) on the receiving end would be appreciated.

intrntbrn commented 2 years ago

Here are some event handlers that i had to implement for my application. The returned strings are keys used to unsubscribe.

subscribeOnReady(clusterID uint64, nodeID uint64, fn NodeReadyFunc)  string
subscribeOnLeader(clusterID uint64, nodeID uint64, fn LeaderUpdatedFunc, oneshot bool) string 
subscribeOnRemoved(clusterID uint64, nodeID uint64, fn UnloadedFunc) string
subscribeOnMembershipChange(clusterID uint64, fn MembershipChangeFunc) string
subscribeOnSnapshotRecovered(clusterID uint64, nodeID uint64, fn SnapshotRecoveredFunc) string
subscribeOnConnectionChange(clusterID uint64, fn ConnectionChangeFunc) string

type LeaderUpdatedFunc func(context.Context)
type NodeReadyFunc func()
type UnloadedFunc func()
type SnapshotRecoveredFunc func()
type ConnectionChangeFunc func(clusterID uint64, nodeID uint64, addr string, connected bool)
type MembershipChangeFunc func(clusterID uint64)

fly3366 commented 2 years ago

Looks go to my project. May add some metric to make tracing more better.

eg. ReceiveQueueLength & SendQueueLength and current capacity for them.

Some hook for warning. like OnMessageDropped& OnMessageRejected for alert.

Cloud33 commented 2 years ago

OnBecameUnreachable(raftio.NodeInfo)
OnBecameAvailable(raftio.NodeInfo)
In other words some API to understand available/unreachable nodes in cluster from RAFT point of view.

This feature is great, knowing the status of all Nodes in the cluster (Unreachable or Available) is very important at times, Raft has a heartbeat mechanism and should be able to know the state of all Nodes relatively simply

fishjam commented 11 months ago

@lni I have checked the code and try to add OnBecameUnreachable/OnBecameAvailable functions. I found the leader will send Heartbeat message to follower , but there is not other follower's active status in Heartbeat message, so follower just know leader's active status, can NOT know other follower's active status. Is it right? and how about this:

add a Actives map[uint64]bool variable in type Message struct , and send it to other raft in sendHeartbeatMessage.
if it's ok, then I will try coding them and create pr.

BTW: I'm using v3.3.8 too. but I don't know how to generate raft.pb.go from raft.proto. I use following command , but there are lot's of difference with your raft.pb.go. could you tell me how to generate it?

   go env -w GO111MODULE=off
   go get github.com/gogo/protobuf
   protoc -I%GOPATH%\src -I%GOPATH%\src\github.com\gogo\protobuf\protobuf -I. --gogofaster_out=plugins=grpc:. raft.proto

example:


func (r *raft) sendHeartbeatMessage(to uint64,
    hint pb.SystemCtx, match uint64, actives map[uint64]bool) {
    commit := min(match, r.log.committed)
    r.send(pb.Message{
        To:       to,
        Type:     pb.Heartbeat,
        Commit:   commit,
        Hint:     hint.Low,
        HintHigh: hint.High,
        Actives:  actives,     // it's id => active from raft.remotes( maybe with nonVotings and  witnesses ?)
    })
}

func (r *raft) broadcastHeartbeatMessageWithHint(ctx pb.SystemCtx) {
    zeroCtx := pb.SystemCtx{}
    actives := make(map[uint64]bool)      // set the actives value
    for id, rm := range r.remotes {
        actives[id] = rm.active
    }
    for id, rm := range r.votingMembers() {
        if id != r.replicaID {
            r.sendHeartbeatMessage(id, ctx, rm.match, actives)
        }
    }
    if ctx == zeroCtx {
        for id, rm := range r.nonVotings {
            r.sendHeartbeatMessage(id, zeroCtx, rm.match, actives)
        }
    }
}

func (r *raft) handleHeartbeatMessage(m pb.Message){
  //handle HeartbeatResp to set raft's active status, and raise event
}

lni commented 11 months ago

Can you first explain why you want those OnBecameUnreachable/OnBecameAvailable in the protocol implementation itself not in your application layer?

@lni I have checked the code and try to add OnBecameUnreachable/OnBecameAvailable functions. I found the leader will send Heartbeat message to follower , but there is not other follower's active status in Heartbeat message, so follower just know leader's active status, can NOT know other follower's active status. Is it right?

that is correct.

fishjam commented 11 months ago

I know you dont' want add it in dragonboat's raft. example:

Availability statuses #235
How can I know the status of group #212
[Help]Listening for cluster changes #215
etc ..

But since dragonboat is a multi-raft framework running on multi servers. It's very important to know the cluster status for application developer(I think that's why lots of people ask similar question for it). And dragonboat already use heartbeat to monitor raft cluster, it should be easy to add this feature. Now application have to implement another set of heartbeat protocols to know each raft node's health status. Seems it's ugly solution. not so multi-raft.

BTW: if application can know raft node OnBecameUnreachable , then maybe can send alert message or email to notify maintenance persons, so they can check it asap.

Could you please think about it again?

lni commented 11 months ago

Applications don't have to use heartbeat, they are free to use whatever cool tech they want. As explained in those issues, there is simply no way to determine whether a server is down, it has to be implemented by the application.

A consensus library is just a consensus library. It doesn't monitor your resources, it doesn't alter your devops guy when certain server is down. When there is certain code in the consensus library that can probably does 20% of what you want for resource monitoring, it doesn't means it is a good idea to cut corners to abuse those existing code.

It is horrible idea to let your consensus library to do resource monitoring. It means one thing and one thing only - the devops team dropped the ball.

lni commented 11 months ago

Let me provide a concrete example on why it is a horrible idea to cut corners to abuse the heartbeat mechanism in dragonboat -

consider you have a deployment of many servers and each of them is used for hosting many Raft replicas. Obviously, they can be overloaded at certain stage, that is the reason why people do split and move replicas around. When overloaded, surely it is okay for dragonboat to randomly drop certain messages, as it can't handle all of them anyway. When multiple heartbeat messages and their responses are dropped in such way, leader/follower replicas can be considered by dragonboat as unavailable - when the server is obviously up and running!

The above feature is essential, the protocol is implemented & carefully tested in a way to ensure such dropped messages won't cause correctness issues - that is reporting certain running servers as unavailable is fine in dragonboat. Will your "maintenance persons" be happy about such false "alarms"? Will excessive amount of such false alarms cause other troubles?

On leader nodes, I believe such availability info has already been exposed to users so they get notified for status change. This is to let users know what the leader replica believes. That is all. It is not your resources monitor or availability alarm.

fishjam commented 11 months ago

Sorry for causing you misunderstanding. We do not want to know the server status , but just the raft node status in dragonboat.

lni commented 11 months ago

Ok, thanks for the clarification.

but just the raft node status in dragonboat.

If that is the case, just wondering why would you care some followers' perceptions on other followers and also why would you need to change the protocol for that?

I am also interested how such raft node status can help you. I mean for any of your Raft shard, you can already tell whether it has a current leader, together with the membership information (i.e. what replicas are involved and what nodes they are running on), why you need the library to pin point the particular node that might not be responding to dragonboat messages? If you do have a monitoring system for all your nodes, maybe just pass on the IPs involved in that concerned Raft shard and check all of them to draw your own conclusion on their availability.

Again, as mentioned multiple times in other issues, in my own systems that are built on top of dragonboat, there is always an independent module monitoring involved nodes, the consensus library is just a consensus library, it does one thing (consensus) and one thing only.

kevburnsjr commented 11 months ago

If I want to automatically consider a nodehost dead and reschedule its replicas when it has been missing for a certain timeout (say 5 minutes) then I would need to implement my own heartbeat system which would probably be less efficient than memberlist/SWIM unless I decided to open another port to run a second instance of memberlist which seems redundant.

I think that dragonboat could probably expose nodehost failure detection without using the protocol's heartbeat as a signal.

memberlist is a Go library that manages cluster membership and member failure detection using a gossip based protocol.

Given that dragonboat uses memberlist for gossip and memberlist has failure detection as a feature, it would be nice if I could tap into that failure detection mechanism rather than implementing my own.

If I could pass a memberlist.EventDelegate into the Gossip config, that might be sufficient.

lni commented 11 months ago

f I want to automatically consider a nodehost dead and reschedule its replicas when it has been missing for a certain timeout (say 5 minutes) then I would need to implement my own heartbeat system which would probably be less efficient than memberlist/SWIM unless I decided to open another port to run a second instance of memberlist which seems redundant.

Thanks for the input!

Quite often that you will have to have a dedicated service using its own port/protocol to talk to your dragonboat based application to collect information on current system load to determine whether certain replicas should be moved to a different machine for load balancing purposes anyway. Such a service is expected to be the same one that looks after the availability of your nodehosts as they all do the same thing here - use raft membership changes to move replicas.

When you build the above mentioned service, you will have to determine & collect those stats, as quite often it is not just raft stats, it is usually your state machine stats. No matter whether you decided to push stats from your application to the service, or the service might just pull stats periodically, or maybe even there could be an intermediate component involved, once you have such a service periodically getting workload stats from your nodehost instances, you don't need any extra availability info any more, if that service doesn't get the stats of a certain nodehost instance for a few minutes, you can consider it as failed.

Exposing the memberlist failure detection result is problematic as it is an optional feature, many applications don't use it at all.

If I could pass a memberlist.EventDelegate into the Gossip config, that might be sufficient.

This is possible in the current implementation. Have a look at the changes in #327, it provides a mechanism to allow users to implement and replace the gossip module entirely, what you need to do is to just build a wrapper of the memberlist gossip module in dragonboat with your event delegate registered. This approach does require the memberlist gossip implementation to be publicly accessible, we might be able to change that if necessary.

As mentioned, my goal is to keep dragonboat a simple consensus library that does one thing (consensus) and one thing only. If a certain nice little & useful feature can be easily implemented on top the consensus library, I'd hope it to be built in that way. It might involves some extra code from users, possibly an extra opened port, but that does help to keep the consensus library itself to be simple and clean.

fishjam commented 11 months ago

@lni I already know your thoughts about it. But I'm a little confused, what is the purpose of creating this issue?

kevburnsjr commented 11 months ago

a dedicated service [...] to collect information on current system load to determine whether certain replicas should be moved to a different machine for load balancing purposes anyway.

That's a good point, thanks.

lni commented 11 months ago

what is the purpose of creating this issue?

To keep tracking what else should be added, things like callbacks to be invoked when the execution engine is blocked, when messages are dropped, when leaders failed to be elected after N rounds.

As mentioned earlier, I think it is useful to have a callback to notify the user when certain follower is not reachable by the leader (only on leaders). I had a quick look at what is current available & what is missing for that -

ISystemEventListener.ConnectionFailed() is invoked when connection to a certain follower is lost. It is a pretty close thing but not as straightforward as I like, especially for the fact that there could be multiple connections per remote node.
leaders use internal/raft/remote.go to keep track the state of remote nodes, the retry state is entered when a connection is lost. This basically tells that the remote node is not ready to accept replication messages due to connectivity issues, whether it is because of poor network quality or node availability is unspecified.

PR to expose the above described raft remote state is welcomed if -

it is clearly named & documented as a reflection of the local leader state, rather than some availability monitoring feature. It should be explicitly called something like ReplicationPaused(reason, shardID, replicaID).
just some pipelining code to get the state exposed, no unnecessary changes to the raft package. how dropped replication and read index requests are reported can be used as the example on how to do it.
any change to the internal/raft must be unit tested with >90% coverage.

Thanks.

btw, @fishjam, I shot you an email to your email address listed on github.

lni / dragonboat

More raft state related callbacks #189