liftbridge-io / liftbridge

Lightweight, fault-tolerant message streams.
https://liftbridge.io
Apache License 2.0
2.58k stars 107 forks source link

Progressive shutdown - shedding #317

Open Jmgr opened 3 years ago

Jmgr commented 3 years ago

Currently, when a Liftbridge server shuts down it stops being a leader for its partitions. If many partitions exist that will result in a flurry of Raft events. Would it be possible to trigger a progressive shutdown to prevent this? Have you had some thought about this @tylertreat?

tylertreat commented 3 years ago

Yes, this is something I've thought a bit about, especially as it relates to rolling cluster upgrades. I think a graceful shutdown would make sense. There would be a few components to this:

  1. If the server is leader for any partitions, transfer leadership to another replica (invoke a ChangeLeaderOp in Raft) and remove self from ISR (ShrinkISROp). This should be down gradually to avoid a flood of Raft ops. Also interrupt any clients currently subscribed.
  2. If server is follower for any partitions, remove self from ISR (ShrinkISROp). This should be done gradually to avoid a flood of Raft ops. Also interrupt any clients currently subscribed.
  3. At this point, probably reject any client requests, e.g. publish or subscribe.
  4. If the server shutting down is the metadata leader, transfer leadership to another node. Perform a Raft barrier to ensure all preceding Raft ops have been applied.
  5. Remove self from Raft group. Need to think through how this works when rejoining, e.g. in the case of restarting/upgrading a node.
  6. Shut down the server.