rabbitmq / ra

A Raft implementation for Erlang and Elixir that strives to be efficient and make it easier to use multiple Raft clusters in a single system.
Other
813 stars 96 forks source link

Opaque payload support #357

Open kjnilsson opened 1 year ago

kjnilsson commented 1 year ago

Machines such as RabbitMQ quorum queues are used to stored potentially large binary data packets that are never really evaluated or used to calculate the state machine logic. Currently such data needs to be embedded in the command terms that are written to the raft log using term_to_iovec/1 in the WAL and segment writer. When a command is applied to the state machine (inc during recovery) the command is fully read from the log but never used during the apply operation. This is clearly redundant.

It would make sense to support such opaque binary payloads in a more efficient way such that they are only read when needed. Additionally it may be possible to maintain their on-disk representation separately from the raft log itself, Allowing for more efficient snapshots that don't need to include the payload itself. Segment compaction has dependency issues and thus severe limitations.

An approach that used a standard raft log with snapshotting + truncation combined with an approach where payloads are deleted / compacted based on liveness information provided by the state machine itself (rather than after a snapshot) could yield a "best of both worlds" approach that would allow Ra to efficiently host, e.g. kv stores with large payload data.

RabbitMQ use cases include: MQTT retained message storage, delayed / scheduled message delivery, quorum queues themselves may benefit from this.

kjnilsson commented 1 year ago

Quick thoughts:

Receive / write to cache, Written to WAL with entry

Apply (no payload, only payload meta data (size))

Read for replication needs to include payload

Write to payload / segment store (who does this? segment writer?)

Payload reading (log effect)

Payload compaction (who, safety, indexing)

Snapshot replication (negotiaote which payloads to replicate)

Snapshot format (include list of all live payload idx (raft idx)))