CRDT: Garbage Collection

pgte commented 6 years ago

Problem:

operation-based CRDTs work by nodes storing and forwarding operations to other nodes, in an append-only ever growing graph of operations.

This has a few draw-backs:

Storage space
Slow start for new nodes joining or rejoining
Wasted bandwidth

pgte commented 6 years ago

Ideally, all participating nodes should agree in a snapshot at some point in time, being that what is transmitted to new joining nodes first, followed by the operations.

Garbage collection is an active research topic but, to my knowledge, it boils down to initiate and being able to get consensus on a checkpoint. To get consensus, you have to know the participating replicas and then get a majority of those replicas to agree on a that checkpoint. The checkpoint would then consist of the state of replica at that point (entry CID). So, as far as I know, we would need a consensus protocol attached to the CRDT in order to get garbage collection / compaction.

pgte commented 6 years ago

Interesting discussion around ORDTs and garbage collection in this research issue: https://github.com/ipfs/research-CRDT/issues/32

marcoonroad commented 6 years ago

With a hash of the previous operations embedded within further operations, nodes can participate the network with just the latest entries. It's the sort of the thing which both Git and the Blockchain use for storage optimizations & validations. Checkpoints could be hard-coded, for example, after 1000 operations, everyone needs to be synchronized on that point to start a new chain from scratch. The checkpoint could be regarded as a new "genesis block" in some sense. After that, all the previous operations/entries could be pruned without any problem.

marcoonroad commented 6 years ago

(I've not read that paper, tho. I'll read soon as possible.)

pgte commented 6 years ago

@marcoonroad sounds like a good strategy in general, but since there is no consensus layer in CRDTs, how do we make this converge? My problem is that given that the local state is an operation tree that may be different in every node, I wouldn't know when to generate a checkpoint. Any insights?

marcoonroad commented 6 years ago

Oh, you're right. CRDT are consensus-free. There's no valid Byzantine general story to agree upon, since all Byzantine stories are valid and eventually a derived common story is listened by all nodes.

marcoonroad commented 6 years ago

Oh, and sorry. Synchronization on a distributed/P2P network is almost impossible. I have forgot that entirely! CRDTs are a quite complex & recent topic.

gpestana commented 6 years ago

I've been thinking about this problem for a bit but TBH, haven't checked much literature on the subject. So, I'm sorry if I'm suggesting something which might be trivial or useless because it has been described somewhere else 😅

Rationale

I agree with @pgte when he says that GC seems to boil down to 'initiate and being able to get consensus on a checkpoint.'

My initial idea on this would be for every replica to keep track of what it knows about the network, more specifically which replicas exist, what are their perspectives on the network and their last CRDT operation applied. This way, we can decide locally when all the replicas agree on a state that can be snapshotted. The network state is kept locally in a network_map and although it should be part of the CRDT, the map itself is not a CRDT and it should not be merged between replicas but rather each replica maintains its network map.

If we make sure that the network_map is kept up to date locally every time a) some replica changes its local state and b) a new replica joins the network, each replica can decide locally when to make the snapshot without the danger of missing concurrent operations, while new replicas are safe too.

How?

The operation based CRDT keeps a map with their current network visibility. The map's key is a replica (ReplicaID) and the values consists of the last operation (operationID) applied to the respective replica, as well as the replicas the node knows exist in the network.

E.g. with replica R1 and R2 and latest operation O1 applied locally in both replicas, both of the replicas would have the following network map:

network_map = {
    R1: { 
        last_operation: 'O1',
        peers: ['R1', 'R2'] 
    },
    R2: { 
        last_operation: 'O1',
        peers: ['R1', 'R2'] 
    }   
}

network_map in R1 and R2

When all the values in the local network map are deep equal, the replica can safely create a snapshop and GC the tombstones.

When a new replica (R3) joins the network, it asks one of the existent replicas for the current state (e.g. asks R1). So, we get the following network maps in each replica:

network_map = {
    R1: { 
        last_operation: 'O1',
        peers: ['R1', 'R2'] 
    },
    R2: { 
        last_operation: 'O1',
        peers: ['R1', 'R2'] 
    }   
}

network_map in R1

network_map = {
    R1: { 
        last_operation: 'O1',
        peers: ['R1', 'R2'] 
    },
    R2: { 
        last_operation: 'O1',
        peers: ['R1', 'R2', 'R3']   
    },
    R2: { 
        last_operation: 'O1',
        peers: ['R1', 'R2', 'R3']   
    }   
}

network_map in R2 and R3

In this case, although the last operation is the same, the R2 and R3 cannot snapshot the CRDT, since the R1 hasn't doens't know about R3 yet and it could have changed its local state between when R1 and R2 where in consensus and then R3 joined the network.

Everytime a local replica applies an operation locally, updates its entry on the network_map with the new last_operation ID. When the replica broadcasts a new operation, it also broadcasts its current visibility of the network.

When a replica receives a new network_map from the network, it will update its version of the network map in the current way:

1) If there is a new replica key in the network map, add it. 2) If a replica (which is not itself) was already part of its network_map but it has a new last_operation, update the value according.

Everytime the network_map is updated locally, the replica checks if all the replicas in the map have the same state of the network (all see same replica and have same last operation ID). If this happens, then the local replica must snapshot the CRDT.

--

The main goal of the network_map is to be able for local replicas to agree on a eventual-snapshot: when locally, we are sure that all the replicas have the same visibility of the network and have their CRDTs in the same state. It seems to me that this approach is somewhat naive and may be hard for a network with sufficient number of replicas to reach a state in which all replicas have the same idea of the network and state. However, I find it likely that there may be improvements which could lead to more snapshots between states. E.g. if we store historical network maps, we can reach a consensus in past in time, snapshot there and then garbage collect the tombstones and past network maps.

This is only conceptual idea but I haven't found any case yet that would make this snapshotting consensus model going wrong. I will try to draw some interaction diagrams and try to find flawed cases.

pgte commented 6 years ago

I think I understand this proposal, which I think can be resumed to: When I think all the nodes have the same view of the CRDT, I make a snapshot.

One question: when nodes diverge (make concurrent edits), each will have a different HEAD operation. Although their state will be eventually equal, the last operation is potentially different: they can receive operations in different orders, etc.

So I think you can generalise this to a map where you keep track of each node's latest seen operation.

Another question: why do you need to gossip the known state? The way I see it, you can update the local view from the CRDT gossip that already happens. When a node says "I have a new HEAD and it's this", you can update your local network representation immediately.

Another question: After updating the local network view, another node in the CRDT can continue to perform local operations concurrently while a this replica is computing the snapshot. Do you think this could be a problem? If so, should we create the snapshot inside the operation tree (saying something like "this snapshot is a child of operations A, B and C".

@gpestana thoughts?

pgte commented 6 years ago

@gpestana ah, I realize that you still need gossip to know what is the network view of other nodes (the list of nodes).

vitorenesduarte commented 6 years ago

Hi everyone. It seems that you're talking about stability.

It's something like this: message m is stable on some node, if all operations this node will deliver from now on, will be in the future of m

Note that stability is a local property, meaning that, if m is stable on some node, it's not necessarily stable on all.

gpestana commented 6 years ago

Thanks for the comments @pgte. My thoughts:

One question: when nodes diverge (make concurrent edits), each will have a different HEAD operation. Although their state will be eventually equal, the last operation is potentially different: they can receive operations in different orders, etc.

Good point! The last_operation should be the oldest operation applied in the local document in terms casual order. Or maybe to ensure that non-dependent operations are not missing, how about a list of all operations applied to each replica? (or an hash of the list)

ah, I realize that you still need gossip to know what is the network view of other nodes (the list of nodes).

Exactly. The idea is for the node to keep not only his view of the network but also everyone else's view. This way we can decide locally whether - and when - it's safe to snapshot.

Another question: After updating the local network view, another node in the CRDT can continue to perform local operations concurrently while a this replica is computing the snapshot.

Yep, that's a good point. But I believe it should not influence the snapshotting point, since after an operation is applied locally, the local network map will be 'tainted'. Only when all the replicas in the network map will have applied the operation set, the snapshot point is reached. And this can be checked by inspecting the the local network map.

If so, should we create the snapshot inside the operation tree (saying something like "this snapshot is a child of operations A, B and C".

That sounds like a good solution too. Would that mean appending a snapshot once in a while and the replicas would pick the snapshot once all the operations (A,B,C in your case) would be fulfilled? If so, what do you think about overhead?

@vitorenesduarte hey there! that's a very good point. Do you know of any way to reach some sort of "strong stability" in which locally you know that all replicas have reached common ground (applied all operations) some at some point in the past? This is what this network map is try to achieve by basically keep everyone's view of the network locally and their last HEAD/list of operations applied in each replicas.

pgte commented 6 years ago

@gpestana Ah, sorry, I was thinking more of how using this you could create a snapshot (to perhaps sync remote nodes faster), but this is directly related to garbage collection.

Which means that a replica is going to truncate all the operations below this "snapshotting point", right?

Let's consider then this scenario:

we start with only two replicas, rA and rB.
rB is writing.
rA is only reading.
At this point in time, rA is lagging behind rB by one operation.
rA receives this missing operation from rB, integrates it it's own state and updates the network view
rA realizes immediately that both rB and rA have the same world view, so it proceeds to create a snapshot, truncate the operations, and gossip about it's new network view.
meanwhile, concurrently, a new replica, rC, enters the CRDT
rC starts immediately making writes, but based on an old version (either because that's what it got from rB, or it has been offline for a while)
rC has new data, and broadcasts its new HEAD
rA gets the new HEAD from rC, but doesn't know how to integrate it because it doesn't have the required operation ancestry.

Does this make sense, is it a problem?

vitorenesduarte commented 6 years ago

@gpestana A consequence of all next operations being in the future of some stable operation m, is that all nodes have applied m. It is a local property in the sense that nodes may not know all the operations that have been applied by all. But if some node A says m is stable and another node B does not see it as stable yet, m is stable, and B will know it "soon".

gpestana commented 6 years ago

@pgte that makes sense. Let's say this network has only 3 replicas (rA, rB, rC). The assumption is that if rC is part of the network, rA or rB know about it and keep rC in their network maps (because either of them shared the state with rC when it joined the network - let's say rC was the last node to join the network).

Thus, rA and rB can only truncate the document when rC has the same network map as them so that all nodes know that everyone exists and that the a minimum set of operations were applied in each replica.

This relies on the fact that every time some replica joins the network (rN), it will ask for the current document to at least one replica (rA). rA then adds rN to the network map. From now on, rA will gossip in his network map that rN exists. So whenever other replicas see rA the new network map, they will add rN to the network map, which will create a dependency so that no one will truncate the document before time.

I'm trying to find a way to write down/make diagrams of these interactions 😅

pgte commented 6 years ago

@gpestana an interaction diagram of this would be super :)

So, at one time rB can be the only one to know about rN (maybe because rB just now discovered rN). rA doesn't yet know about rN, it thinks that the network is only constituted by rA and rB. rA has an inconsistent view of the network, which makes it perform garbage collection. If, by doing so, rA removes operations that are in the past of new operations now being created by rN, rA will not have, in the future, (without more than the local information it has), the possibility of reasoning about the cause of these new operations created by rN.

@gpestana makes sense?

Alexis211 commented 6 years ago

I have no knowledge of how IPFS works but I was just reading about CRDTs and thinking about garbage collection. It's funny that I stumbled upon this discussion just as it is happening.

Here is my idea: can we get nodes to agree implicitly on which CRDT states are checkpoints by observing a common rule? For example we could calculate a hash of the state and decide to checkpoint only if the first n bits of the hash are zeroes (just like proof of work), then calibrate n so that a checkpoint state is on average generated every few minutes. When such a checkpoint state is found it is broadcast to the network. Every node rebases its changes upon the latest discovered checkpoint state.

Sorry if this is stupid or irrelevant.

pgte commented 6 years ago

@Alexis211 yes, I think that's a valid protocol where you replace explicit consensus with implicit. But I think this is better suited to creating snapshots (for faster sync), but can't be safely be used to truncate the operation log (because of the nodes that may be offline). If you're interested in this snapshotting discussion, you can check issues/14#issuecomment-387657177.

pgte commented 6 years ago

adding @satazor and @marcooliveira, they're interested in GC for CmRDTs.

pgte commented 6 years ago

So I've recently learned about causal stability (from @vitorenesduarte and a colleague I cannot find on Github). Let me see if I get this right:

A given operation (or, more generally, message) is causally stable if all operations to be received are in the future of this operation. This means that, if am operation is stable, we don't expect more operations that are concurrent to this operation, and thus we can compact the operation log of all operations preceding it.

So, in order for a replica to be able to compact the log, it needs to know which operations are causally stable. This can be achieved by a) knowing the entire replica membership and b) storing the latest vector clock for each replica.

This can be achieved by, when receiving a new operation from a replica, that operation contains causal information in the form of a vector clock. For every operation that can be causally delivered (for which we have delivered all the dependent operations), we store that vector clock.

We can, for instance, if we have replicas A, B and C, in replica A we can build a table containing all the vector clocks replica A knows for each peer:

replica	VC
A	`(A:0, B:0, C:0)`
B	`(A:0, B:0, C:0)`
C	`(A:0, B:0, C:0)`

As time progresses, replica A starts receiving operations and will then update this table to, for instance:

replica	VC
A	`(A:2, B:5, C:3)`
B	`(A:2, B:5, C:2)`
C	`(A:1, B:5, C:3)`

Replica A can then infer which is the message that is causally stable by doing a point-wise minimum on all vector clocks.

min((A:2, B:5, C:3), (A:2, B:5, C:2), (A:1, B:5, C:3)) = (A:1, B:5, C:2)

We can then infer that the all the operations with vector clocks equal or inferior to (A:1, B:5, C:2) can be compacted.

What about nodes joining?

When a new node D joins, it has to introduce itself to another node and then get it's state. Let's say that node D connects to node C to get the most recent state. Now, node C knows about node D, which will start from the state that node C has. Even if, while node D is bootstrapping, node C progresses, node C still knows about node D, and will create a corresponding entry on the it's "causal stability table" and so will never compact history below it.

What about nodes leaving

The problem with this approach is that a node should have a leave protocol. If a node knows it's leaving, it should notify other nodes so that it gets removed from the table. If a node crashes and is removed from that table, either:

when a node times out, it's eventually removed from the list of known peers
is manually removed from that table

pgte commented 6 years ago

(Found @gyounes on Github. Thank you!)

gpestana commented 6 years ago

Yes, that's a very good approach to find a snapshot point locally, IMO. I was trying to explain the same above but you did it so much better and clearly! Also, I recently found out that this is similar to the idea of maintaining matrix clocks locally, instead of vector clocks. With matrix clocks the local replica maintains a view of the whole network and locally decide when to snapshot - just as you mentioned.

I was concerned with what would happen if some node leave the network permanently, creating a 'snapshot deadlock'. As you mentioned, if a leaving mechanism and node timeout are in place, then it should be fine!

I recently read some papers presenting similar solutions:

pgte commented 6 years ago

Yes, it’s a similar approach, with the same efffect. One important thing I forgot to mention: Removing a replica requires coordination amongst remaining nodes. Since a remaining node can have operations from a node about to be removed, it’s required to query all nodes about this before effective removal.

gyounes commented 6 years ago

@pgte, so we didn't get the time to talk about how op-based CRDTs are used in IPFS. Also, are the that you use merkle trees causal ?

pgte commented 6 years ago

@gyounes I'm deferring this to a specific issue I created for this purpose. :)

vitorenesduarte commented 6 years ago

I was wondering why the current implementation does not require something like VCs, and, from what I understand, it's because all causal information is encoded in the DAG of operations. This probably means that VCs won't be needed to detect which operations are stable. If a node knows the HEADs from all the other nodes, it could compute the greatest common ancestor (GCA) [1] of all these HEADs, and this GCA and all preceding operations are stable.

[1] https://en.wikipedia.org/wiki/Lowest_common_ancestor (replaced lowest by greatest, since lowest seems to refer to the relative height and not the order)

pgte commented 6 years ago

@vitorenesduarte yes, that's a good point. The thing with vector clocks is that this computation can be done without I/O, while the other we need to traverse the DAG nodes. On the other hand, the latest DAG nodes can be kept in a memory cache, but it will still be less efficient that using VCs.

gritzko commented 6 years ago

My personal intuition is rooted in Web systems, where lots of clients come and go. It is impossible to differentiate temporary departures from permanent departures. Generally, if a system can have flash crowds then VC is not a good option.

amark commented 6 years ago

@pgte & others, I'm very confused - I've spent ~9 years on CRDTs and last 4 building GUN ( https://github.com/amark/gun ) which is the most popular CRDT library out there and one of the most popular P2P/decentralized projects in general. I am confused because:

Most CRDTs resolve state and compact automatically - thus why they are called commutative. Why is GC even being discussed?

The only CRDTs that don't are ones that need to emulate 0-sum or centralized logic, for instance, a counter (the most classic CRDT, admittedly, but can be implemented in as little as 12 lines of code), that need history.

This is because they are 0-sum/centralized/Newtonian that they need a history, not because CRDTs in general need a history, GUN's does not need history, it automatically converges state as it goes, and can already handle terabytes/daily traffic on decentralized Reddit in a P2P mesh network.

pgte commented 6 years ago

@amark Even though there may have been some conflict-free structures and work before that, the term was officially coined in the seminal paper in 2011...

About garbage-collection:

As described in the seminal paper, while state-based CRDTs messages are commutative, operation-based CRDTs use an operation log and require causal message delivery, so the need for a log compaction strategy that is safe.

In the recent days I've had some CRDT researchers kindly explaining to me and the team how to do garbage collection in operation-based CRDTs, some of them commenting in this very issue, I recommend you check it out if you want to understand why this problem matters.

gritzko commented 6 years ago

I'm very confused - I've spent ~9 years on CRDTs

The term was formally defined in 2011, I believe.

handle terabytes/daily traffic on decentralized Reddit

It does not look that crowded though.

Most CRDTs resolve state and compact automatically - thus why they are called commutative

Regarding the "commutative" part: https://en.wikipedia.org/wiki/Commutative_property http://swarmdb.net/articles/acid2/

amark commented 6 years ago

@pgte @gritzko I know, I didn't hear the term/phrase CRDT until several years later. Doesn't mean people weren't working on them prior to 2011.

Obviously my bias is for state based CRDTs, having an op-based CRDT just sounds like a lot of event-sourcing / append-only log work (I abandoned this approach in 2012 / 2013 when I couldn't get it to scale), the state-based approach does though.

@gritzko it isn't, but it still pushed a lot of traffic, so I know it can scale (I desperately had to fix some hiccups that they had, still dealing with some storage issues, but network is pretty stable now, although traffic has declined since launch).

I hope you aren't being sarcastic dropping a wikipedia article. The fact that an operation is commutative also means its subsequent operations are commutative with the result of the previous operations - aka, the result is the compaction of the prior operation: You don't need to store the history.

gritzko commented 6 years ago

I know, I didn't hear the term/phrase CRDT until several years later. Doesn't mean people weren't working on them prior to 2011.

That's for sure. I had Causal Trees working in 2008 when B.Cohen pointed out its similarity to weave. And weave is grey-bearded classics of revision control (1972). https://bramcohen.livejournal.com/49535.html?thread=765311#t765311

Obviously my bias is for state based CRDTs, having an op-based CRDT just sounds like a lot of event-sourcing / append-only log work (I abandoned this approach in 2012 / 2013 when I couldn't get it to scale), the state-based approach does though.

I can't disagree more.

You don't need to store the history

Add-to-log is a perfectly commutative op if we produce a sorted timestamped log. Like, most of the time the last op should be simply appended, in case of concurrent changes we may need to bubble it up to fix the order. Such an operation is commutative, associative, idempotent.

amark commented 6 years ago

@gritzko nice!!! If Bram 👍 it, then that gives high-confidence it is good. I recently had the honor of meeting him after doing a lightning talk before him and got to chat with him in his office for a few hours. If you live in the Bay Area, lets meet (I just moved here, finally!).

Sorting logs is ~ O(N) which === my comment about it not being scalable. Could you elaborate on why you disagree?

gritzko commented 6 years ago

Sorting logs is ~ O(N) which === my comment about it not being scalable. Could you elaborate on why you disagree?

Sorting is O(NlogN). That is for the entirety of the process, assuming unsorted inputs. Simply loading/saving data is O(N), obviously. Appending is O(d).

pgte commented 5 years ago

Right now we opted for ∂-state-CRDTs to avoid having to deal with this issue. May reopen this thread in the future if we decide to use op-based CRDTs.

amark commented 5 years ago

@pgte why persistent culture of not collaborating with other projects, but then just trying to copy cat them?

pgte commented 5 years ago

@pgte why persistent culture of not collaborating with other projects, but then just trying to copy cat them?

@amark What are you referring to?

amark commented 5 years ago

At one point during the summit, Feross got up and demoed video streaming/torrenting in the browser (a demo he has done for several years now) - then the next speaker got up, and announced IPFS was working on a new experimental custom video streaming/torrenting library. That would be fine if IPFS / WebTorrent were not aware of each other, but we've all been talking and demoing our tech to each other since 2014.

There are other examples, like I replied in the other thread with CRDTs and GUN. If you guys change your mind, let us know - we're quite open and the chatroom is very friendly.

pgte commented 5 years ago

@amark this is totally out of topic and out of our control, those were community projects. If you want to contribute to the specific problem stated in this issue, please do.

ipfs-inactive / dynamic-data-and-capabilities