nats-io / nats-streaming-server

NATS Streaming System Server
https://nats.io
Apache License 2.0
2.51k stars 284 forks source link

[FIXED] Clustering: leadership acquired actions could get stuck #1287

Closed kozlovic closed 1 year ago

kozlovic commented 1 year ago

If a leadership changed occurred while leadership actions were executed, before the raft.Barrier() call was made, the server would be stuck in that call. This is because RAFT library notifies the Streaming server code that a leadership changed through a go channel that was just of size 1. Since the streaming server read from the channel and then executes the leadership acquired code, it could not read from the notification channel that caused the RAFT library to block on a go channel send, which then made the Barrier() call block.

I believe the right approach is to have a bigger notification go channel instead of making Barrier() time out. If it does timeout, the server should then transfer leadership, which I am afraid could cause a cascading effect if all servers getting elected need longer that the chosen timeout to apply all the preceding entries to the FSM.

Signed-off-by: Ivan Kozlovic ivan@synadia.com

coveralls commented 1 year ago

Coverage Status

Coverage: 91.526% (+0.04%) from 91.49% when pulling ee84146d19962dbc059040f0d83f899122387d30 on fix_leadership_acquired into 2af2beb736afa0abb58d91f5e8a8bd4ecfccd187 on main.