Closed Jmgr closed 4 years ago
I have added partition pausing and resuming all partition at once options in 8243b87. Note that I have named the partition selection option PartitionIndices
instead of the perhaps more intuitive Partitions
, but this function name is not available since it is already used to select the number of partitions when creating a stream. The stream creation option could be called PartitionCount
, but that would break the API compatibility.
This PR proposes an API to “pause” a stream, that is, deactivate its partitions. Publication to one of these partitions using the Liftbridge API “resumes” the stream, restarting its partitions.
Context
Our use-case will include having a significant number of streams, but only a small fraction of those will be active at any given point in time (sparse streams).
In case of a server shutdown we would like to restart only active streams, not paused ones. The activity would record stream pausing and resuming, so that on server startup only the active streams would be restored.
The aims of this PR are twofold:
inactive streams are paused so as to avoid the memory and CPU cost of maintaining stream activity and replication when there are no messages;
by passing pause and resume events to a meta-event stream, we can schedule activity for a consumer pool more effectively.
This PR is therefore linked to the activity stream PR (https://github.com/liftbridge-io/liftbridge/pull/169).
Implementation
The implementation proposed in this PR provides a new
PauseStream
API that allows clients to request that a stream be paused. Pausing a steam is very similar to closing a stream; that is, closing all of its partitions. Closing a partition, in turn, means that the server is no longer a leader or a follower for it and that its commit log is also closed. A “paused” flag is also set on the stream’s partitions. This allows us to spare CPU, memory usage and close unused file handles. One could also imagine an auto-pausing feature for each stream, where after a certain duration without any message publication a stream and its partition would be automatically paused. This would have to be added in a future PR however.Resuming a stream occurs when a message is published to one of its partitions. All partition are then re-created using the same parameters, including the same data directory. The message is then published as usual.