ipfs-inactive / dynamic-data-and-capabilities

[ARCHIVED] Dynamic Data and Capabilities in IPFS Working Group
59 stars 6 forks source link

map CRDTs on top of IPLD #1

Closed pgte closed 6 years ago

pgte commented 6 years ago

So far, this has been the progress on mapping CmRDTs on top of IPLD and IPFS:


CRDTs provide strong eventual consistency over a shared data structure that is replicated amongst nodes. Replicas can be updated concurrently and without coordination, always being mathematically possible to solve existing conflicts. Two types of CRDTs exist: Convergent (CvRDTs) and Commutative (CmRDTs).

Convergent CRDTs propagate the entire state to other replicas, and then the states are merged by each replica using a merge function that is commutative, associative and idempotent.

In contrast, commutative CRDTs (CvRDTs) are operation-based. Here, each replica propagates state by transmitting only the update operations. The operations are commutative, but, unlike CmRDTs, they're not idempotent, which means that the communication infrastructure must therefore ensure that all operations on a replica are delivered to the other replicas without duplication, but in any order.

Data types with commutative operations can be trivially implemented as pure op-based CRDTs using standard reliable causal delivery (1) . As proposed by Carlos Baquero et al. in (1), we can implement a tagged reliable causal broadcast – that provides causality information upon delivery, relieving the data type implementation of needing to have causal information inside the operation. Instead, providing causal operation delivery, data types only need to a) create operation messages based on current local state and b) reduce received operations into the new version of the state.

Watch Pedro give a talk at an IPFS meetup in January 2018 where he describes this: .

Here is some progress in libraries implementing this:

pgte commented 6 years ago

Some more detailed description of the peer-crdt IPLD model:

We can then model this causal operation log on top of IPLD, so that we can create custom CRDTs on top of it.

So far, these are the evolving requirements for this data structure:

pgte commented 6 years ago

Some shortcomings of this approach:

Garbage collection in CRDTs

Garbage collection is an active research topic but, to my knowledge, it boils down to initiate and being able to get consensus on a checkpoint. To get consensus, you have to know the participating replicas and then get a majority of those replicas to agree on a that checkpoint. The checkpoint would then consist of the state of replica at that point (entry CID). So, as far as I know, we would need a consensus protocol attached to the CRDT in order to get garbage collection / compaction.

Achieve the same or a better level of performance when using a CRDT over IPLD that we get from using with direct connections.

In a different iteration of CRDTs on top of IPFS, we were applying the Y.js protocol on top of the libp2p pubsub and libp2p multiplexed connections. After the initial sync, the data of the operation was pushed as a part of the broadcast message. In this approach, instead of getting the changes pushed, each replica pulls the entries that aren't contained in the current HEAD, resolving it through IPLD. This is slow, specially because traversing a graph of missing entries may require many round-trips. As a mitigation for this, instead of just broadcasting the current HEAD, replicas also broadcast the chain of parent nodes (up to X entries). This is still a large bottleneck for this to work in real-time.

How to audit a CRDT (operation history similar to version history from version control)

CRDTs should contain enough information so that the application user can, for any change on a shared data structure, determine who was responsible.

pgte commented 6 years ago

I consider this done, tracking the snapshot, garbage collection and audit in other issues.