ipfs / notes

IPFS Collaborative Notebook for Research
MIT License
400 stars 33 forks source link

δ-CRDTs #406

Open pgte opened 6 years ago

pgte commented 6 years ago

Delta State Replicated Data Types

Paper: https://arxiv.org/pdf/1603.01529.pdf

Conflict-Free Replicated Data Types (CRDTs) are distributed datatypes that allow different replicas of a distributed CRDT instance to diverge and ensures that, eventually, all replicas converge to the same state.

State-based CRDTs work by replicating the entire state in each message and by providing a merge function that merges two states (the remote state and the local state) into a new state that includes both. They provide strong eventual consistency (SEC) even though they do not require a reliable delivery mechanism. If the state of a certain peer has not been propagated, it will eventually be propagated in the future. Here, losing messages is not a big problem.

One downside of state-based CRDTs is that the size of the messages is proportional to the size of the state, which, for some non-trivial types, can become a problem.

On the other hand, replicas in operation-based CRDTs communicate the operations to each-other. This makes message size smaller, but brings some new problems:

Delta State CRDTs (δ-CRDTs) address these concerns by rethinking State-based CRDTs. While being able to transmit the entire state, δ-CRDTs can also improve the message size by defining delta-mutators and delta-state. This delta-state is typically much smaller in size than the entire state.

δ-mutators

Rather than shipping the entire state, δ-CRDTs ship a representation of the effect a recent update operation had on the state, while preserving the idempotent nature of join (the function that takes two δ-states or two states or one state and one δ-states and returns a new state). Preserving this ensures convergence over unreliable communication, since merging two states is an idempotent operation.

The state is still a join-semilattice that now results from the join of multiple fine-grained states (deltas), generated by δ-mutators.

δ-mutators are a new version of the type-specific mutators that return the effect of these mutators on the state, which is called a delta-state.

These delta-states can then be retained in a buffer to be shipped individually or joined in groups, instead of shipping the entire state. The changes are then incorporated by other replicas by joining the shipped delta-states with their own states.

In a δ-CRDT, the effect of applying a mutation, represented by a delta-mutation δ = mδ(X), is decoupled from the resulting state X′ = X ⊔ δ, which allows shipping this δ rather than the entire resulting state X′.

Also:

If the causal order of operations is not important and the intended aim is merely eventual convergence of states, then delta-groups can be shipped using an unreliable dissemination layer that may drop, reorder, or duplicate messages. Delta-groups can always be re-transmitted and re-joined, possibly out of order, or can simply be subsumed by a less frequent sending of the full state, e.g., for performance reasons or when doing state transfers to new members.

Central equation

m(X) = X ⊔ mδ(X)

This equation states that applying a delta-mutator and joining into the current state should produce the same state transition as applying the corresponding mutator of the standard CRDT.

The challenge here is finding a non-trivial decomposition such that delta-states are smaller than the resulting state.

size(mδ(X)) ≪ size(m(X))

A very simple example: δ Grow-only Counter

A state-based grow-only counter CRDT (GCounter) is defined by:

A δ GCounter has the same definition as the traditional GCounter, but, instead of an inc mutator, it has a inc-δ mutator:

δ GCounter

The basic anti-entropy algorithm

This paper then presents a basic anti-entropy algorithm that ensures eventual convergence in δ-CRDTs:

Anti-entropy algorithm

For the node corresponding to replica i, the durable state that persists after a crash is simply the δ-CRDT state Xi. The volatile state D stores a delta-group that is used to accumulate deltas before eventually sending them to other replicas.

When an operation is performed, the corresponding delta-mutator is applied to the current state Xi, generating a delta d. This delta is joined both with Xi, to produce a new state, as well as with D.

A node sends its messages periodically, where the message payload is either the Delta-group Di or the full state Xi. This decision is taken by the function choose i, which returns one of them. To simplify, a node simply broadcasts the messages without distinguishing between neighbours.

After each send, the delta-group is reset to the initial state.

Once a message is received, the payload d is joined into the current state. Here, the algorithm operates in two modes:

The decisions of whether to send a delta-group versus the full state (typically less periodically), and whether to use the transitive or direct mode are out of the scope of this paper. In general, decisions can be made considering many criteria like delta-group size, state size, message loss distribution assumptions, and network topology.

Causal consistency anti-entropy algorithm

Traditional state-based CRDTs converge using joins of the full state, which im- plicitly ensures per-object causal consistency.

Therefore, it is desirable to have δ-CRDTs offer the same causal-consistency guarantees. This raises questions about how can delta propagation and merging of δ-CRDTs be constrained and expressed in an anti-entropy algorithm.

For this, this paper introduces a particular kind of delta-group, which is called a delta-interval, which is a continuous range of delta, where the order of deltas represent causality.

The causal anti-entropy algorithm defines that a delta-interval is only joined into states that reflect the state into which the first delta was previously joined.

Causal anti-entropy algorithm

This algorithm distinguishes between neighbour nodes, only sending them delta-groups that obey the delta-merging condition and can thus be joined at the receiving node.

To acheive that, it stores (additionally to the state):

When executing an operation, the deltas in D i are updated by associating c i to the delta produced by the mutator.

When receiving a remote delta, a node joins and updates the CRDT state, updates D i by associating the current c i with that delta, and then increments c i. At the end of this, the receiving node sends back an acknowledge message to the sender node, confirming that it has integrated that epoch.

Periodically, a node picks a random node from the set of neighbour nodes. It then figures out what to send. It can either send:

When a node receives an acknowledge message from another node, it updates the ack map (A i) with the opoch between the received and the stored one (which, when not yet known, defaults to 0).

Some examples of more complex CRDTs

To come in this issue: show and explain some examples of δ-CRDTs that are more complex than a counter. Subscribe to be notified!

pgte commented 6 years ago

Some examples of more complex CRDTs

This first example of a δ-CRDT does not require distinguished node identifiers for the mutations.

But before we do this, we need to introduce some concepts: a Pair and a Lexicographic Pair of join semi-lattices.

Pair

Pair

The join of a pair is the coordinate-wise join of the pair components.

Lexicographic Pair

Lexicographic Pair

In this construction, the first pair takes priority when establishing the result of a join. The join of the second component is only performed when there is a tie.

Anonymous δ-CRDT

δ-GSet (Delta Grow-only Set)

GSet

A GSet is a set that does not allow removals, hence the only operation, named insert. This operation produces a set with only one element, the element just added. This delta, when joined with the state, produces the inflated state. The join of two sets is accomplished trivially by unionizing both sets.

δ-2PSet (Delta Two-Phase Set)

2PSet

This type allows the removal of elements by the state having an additional set: the set of elements removed. The resulting delta-set of the operations (insert and remove) is also a pair, where the first element contains a singleton set with the element added, and the second element contains the element removed.

The join of two states is trivially achieved by performing the pair-wise join.

The elements can be computed by returning all the elements in the first set (the added elements) that are not on the second set (the removed elements).

The shortcoming of this simple design is that once removed, elements cannot be re-added.

δ-CRDT Add-Wins LWW (Last Write Wins) Set

δ-CRDT Add-Wins LWW Set

This set works by tagging each element with a timestamp. Each time an element is added, it's tagged with a client-supplied timestamp and the boolean True. Removed elemens are also tagged this way, but with the boolean False. When joining two sets, those elements in common compete and the one with the highest timestamp. (For that, the authors use a Lexicographic Pair, where the join uses the order of the first element to define the outcome.)

δ-CRDT Remove-Wins LWW (Last Write Wins) Set

A dual construction to the Add-Wins LWW Set is a Remove-Wins LWW Set, where remove operations take priority on the event of a timestamp tie. This construction has been widely deployed in production as part the SoundCloud system.

Named δ-CRDTs

To come in this issue: show and explain some examples of named δ-CRDTs. Subscribe to be notified!

pgte commented 6 years ago

Named δ-CRDTs

When each node has an unique identifier, each replica is able to only change a specific part of the state. Previously, this paper introduced a delta-based GCounter, where the state tracks the count in each node.

In the notation used, the distinction is that mutations use the replia identifier i.

δ-PNCounter (Delta Positive Negative counter)

By composing a pair, where the first contains increments and the second contains decrements, we're able to create a delta CRDT that allows increments as well as decrements:

PNCounter

The value is defined recurring to the value of the GCounter type which, if you remember, was the sum of all the values in the state map.

δ-LexCounter (Delta Lexicographical Counter)

There is an alternative to keeping a separate count of increments and decrements.

A Lexicographical counter is updated by either incrementing or decrementing the second component of the lexicographic pair corresponding to the replica issuing the mutation. A decrement also increments the first component, making it win upon a lexicographical join.

When join two states, for each node in each of the states, the PNCounter selects the maximum of both sets. In contrast, a Lexicographical Counter keeps two counter: one with the actual counter and a monotonically increasing counter, so that, when merging the state for the same node, the most recent version wins. (Basically, it works like a vector clock, but without the neire ed to increase only when an increment happens — an increment already inflates the state).

Here is the entire definition of a LexCounter:

LexCounter

pgte commented 6 years ago

Just released: implementation of delta state-based CRDTs in Javascript: https://github.com/ipfs-shipyard/js-delta-crdts

pgte commented 5 years ago

Delta-crdts demo: https://www.youtube.com/watch?v=Cn9pIX8BWIU (video).

haiitch commented 5 years ago

This is great. Do you envision any of this be available through the go-ipfs api?

On 17 August 2018 at 11:48, Pedro Teixeira notifications@github.com wrote:

Delta-crdts demo: https://www.youtube.com/watch?v=Cn9pIX8BWIU (video).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ipfs/research-CRDT/issues/31#issuecomment-413889911, or mute the thread https://github.com/notifications/unsubscribe-auth/AABbokwZ3HxdHA4w8Lo-6Nm-Lsc2Rn0pks5uRtfJgaJpZM4TS_51 .

pgte commented 5 years ago

@htrob currently there are no plans for it, but there's peer-star, which underneath uses js-ipfs and delta-crdts.