pgte commented 6 years ago

Optimize storage and convergence time in causal CmRDTs

Context

Eventual consistency

Unlike Strongly Consistent systems, Eventually Consistent (EC) systems don't require synchronization between peers in order to modify data. Changes can be done locally or to a small number of replicas and then asynchronously replicated to others, eventually reaching them. These systems are more resistant to network partitions, and thus are suited to being used in decentralized environments where connectivity can be low. The drawback is that, in a given point in time, data is not guaranteed to be synchronized between peers.

Strong Eventual Consistency

Strong Eventual Consistency (SEC) is a stronger type of EC where some properties are guaranteed: when all the replicas have received all the messages independently of their order, they are guaranteed to reach the same state.

Under these constraints, data is still not guaranteed to be equal in all replicas in a given point in time, but data is guaranteed to eventually converge and to be monotonic (a replica never undoes a change). Conflict-free Replicated Data Types are a data type that guarantees Strong Eventual Consistency.

Conflict-Free Replicated Data Types

Conflict-Free Replicated Data Types (CRDTs) are a class of data types that provides strong eventual consistency guarantees, and which has several interesting properties:

Changes are local: All changes are performed locally, without the need to contact other replicas to guarantee a write. The synchronization is done asynchronously, when and if there is network connectivity.
Determinism: No matter what the order of delivery of messages, replicas are guaranteed to converge. This is particularly suitable in poorly connected networks, while also supporting offline editing.

Operation-based CRDTs

CRDTs can be of two different basic types: State-based (convergent or CvRDTs) and operation-based (commutatite or CmRDTs). State-based CRDTs replicate by transmitting their state, while operation-based CRDTs store and transmit operations that must be transmitted exactly once.

Operation DAG

Operation-based CRDTs that enforce causality tend to form a directed acyclic graph (DAG) of operations. A fork in this operation graph represents a divergence, where two or more replicas performed concurrent operations.

The problem

Intuition

When a network partition occurs, each partition can keep editing the document independently, joining once the network enables them to reconnect and sync the changes to each other. This means that the operation graph can keep growing independently. These changes will be based on a common operation ancester, but will keep diverging. Once the network heals, all partitions will propagate the changes to each other, referring back to the common ancester.

Intuitively, we can say that every replica needs to keep the entire operation history around in case there is an offline replica that needs to replicate some time in the future.

This brings some obvious problems:

Unbounded storage size

In a naive implementation, an operation-based CRDT requires an unbounded storage size. This can be a problem for long-running CRDTs, since, although the state size can be bound, the size of the storage required for the operations is not.

Slow join and catch-up

For new replicas that enter a CRDT that has a long history, all these operations have to be sent across the network. Convergence time for newly-joined fresh replicas is then proportional to the size of the history, which is not ideal for long-running CRDTs. This problem is also present in CRDTs that have been offline for a while and need to catch up with the latest operations in other replicas — and replicate the new operations, — where the time it takes to for a replica to catch up with another is proportional to the size of the new operations that need to be transmitted.

Hard requirements

Reduce the time a new replica takes to sync the state by optimizing the amount of messages and their total payload size, without compromising system security.

Keep in mind that replicas can be offline for a long period of time, creating local operations that will be later replicated some time in the future. The ability for other replicas to accept these changes should not be compromised.

(If necessary, define upper bounds for the amount of offline time / divergence this system can sustain).

Soft requirements

Enable defining an upper bound for the total size of local storage (dedicated to storing operations) required for each replica.

Open problems

In any causal CmRDT, is there a protocol that can optimize the time required to synchronize any fresh replica without compromising eventual changes in offline replicas?
In any causal CmRDT, is there a protocol that can optimize the time required for synchronizing any two replicas that have strongly diverged without compromising changes in offline replicas?
In any causal CmRDT, is there a protocol that allows defining an upper bound for the required total size of local storage dedicated to operations without compromising changes in offline replicas?

protocol / research