map CRDTs on top of IPLD

pgte commented 6 years ago

So far, this has been the progress on mapping CmRDTs on top of IPLD and IPFS:

CRDTs provide strong eventual consistency over a shared data structure that is replicated amongst nodes. Replicas can be updated concurrently and without coordination, always being mathematically possible to solve existing conflicts. Two types of CRDTs exist: Convergent (CvRDTs) and Commutative (CmRDTs).

Convergent CRDTs propagate the entire state to other replicas, and then the states are merged by each replica using a merge function that is commutative, associative and idempotent.

In contrast, commutative CRDTs (CvRDTs) are operation-based. Here, each replica propagates state by transmitting only the update operations. The operations are commutative, but, unlike CmRDTs, they're not idempotent, which means that the communication infrastructure must therefore ensure that all operations on a replica are delivered to the other replicas without duplication, but in any order.

Data types with commutative operations can be trivially implemented as pure op-based CRDTs using standard reliable causal delivery (1) . As proposed by Carlos Baquero et al. in (1), we can implement a tagged reliable causal broadcast – that provides causality information upon delivery, relieving the data type implementation of needing to have causal information inside the operation. Instead, providing causal operation delivery, data types only need to a) create operation messages based on current local state and b) reduce received operations into the new version of the state.

Watch Pedro give a talk at an IPFS meetup in January 2018 where he describes this: .

Here is some progress in libraries implementing this:

peer-crdt - An extensible collection of operation-based CRDTs that are meant to work over a p2p network.
peer-crdt-ipfs - IPFS network and storage adapter for peer-crdt.

pgte commented 6 years ago

Some more detailed description of the peer-crdt IPLD model:

We can then model this causal operation log on top of IPLD, so that we can create custom CRDTs on top of it.

So far, these are the evolving requirements for this data structure:

Each log entry contains:
- The operation payload (signed and encrypted), null for merge entries
- The author authentication data (arbitrary)
- The entry CIDs for the parents of this entry (enables causal delivery)
Each log entry has a CID
Each replica has the current entry HEAD CID
Each replica broadcasts it's HEAD CID
Each replica listens for other peers' HEAD CID, and retrieves the missing log entries. If these log entries contain operations:
- create a new merge entry that has no payload and has two parents: the previous local head and the remote head.
- Apply the new operations to the CRDT in causal order.

pgte commented 6 years ago

Some shortcomings of this approach:

Garbage collection in CRDTs

Garbage collection is an active research topic but, to my knowledge, it boils down to initiate and being able to get consensus on a checkpoint. To get consensus, you have to know the participating replicas and then get a majority of those replicas to agree on a that checkpoint. The checkpoint would then consist of the state of replica at that point (entry CID). So, as far as I know, we would need a consensus protocol attached to the CRDT in order to get garbage collection / compaction.

Achieve the same or a better level of performance when using a CRDT over IPLD that we get from using with direct connections.

In a different iteration of CRDTs on top of IPFS, we were applying the Y.js protocol on top of the libp2p pubsub and libp2p multiplexed connections. After the initial sync, the data of the operation was pushed as a part of the broadcast message. In this approach, instead of getting the changes pushed, each replica pulls the entries that aren't contained in the current HEAD, resolving it through IPLD. This is slow, specially because traversing a graph of missing entries may require many round-trips. As a mitigation for this, instead of just broadcasting the current HEAD, replicas also broadcast the chain of parent nodes (up to X entries). This is still a large bottleneck for this to work in real-time.

How to audit a CRDT (operation history similar to version history from version control)

CRDTs should contain enough information so that the application user can, for any change on a shared data structure, determine who was responsible.

pgte commented 6 years ago

I consider this done, tracking the snapshot, garbage collection and audit in other issues.

ipfs-inactive / dynamic-data-and-capabilities