A few questions on CRDTs

@russelldb

Hi Russell!

I'm learning about CRDTs and have run into a fair amount of your content on how to architect them (thanks!). There's a few approaches that I'm a little confused by. Appreciate your help and time.

The main CRDT I've been studying is the delta-state CRDT map. Most implementations I've seen will structure this as a normal map with a version vector at the root and a dot list at each node. My understanding is that when a replica makes an edit to a node, it will update the version vector (by incrementing its counter) and append a new dot (using the same counter value) to the dot list at that node. You mention in one of your articles that if another replica makes an edit to that same node as well, when the local replica synchronizes with it, it will append the remote's value and dot to that node. The end result is one node with two concurrent values and two dots in the dot list.

There are several things I find confusing about this: 1) How do I know which value to "use" at the node? I understand that CRDT maps are composable, so the node value could behave as a LWW or a MVV, etc. If it was an LWW for example, I imagine that when a replica wants to read the value I would look at both dots in the dot list (assume the dots have a physical timestamp attached) and return the value associated with the more recent dot. This brings up a further question - why wouldn't I apply the LWW right away during sync? So instead of appending the remote dot and storing both values, instead I would a) compare the version vectors and notice a concurrent update at the same node and b) overwrite the dot list to only store the most recent dot and update the node to the associated value. I would still merge the root version vectors, so I know that the document has "seen" and merged all the remote changes. But this way, I wouldn't need to store multiple values for the same node, especially when one of those values is never returned. In my mind, it's not clear to me why a dot list is necessary, at least for LWW registers. I can see why you would need them for a MVV. 2) When does the dot list shrink? I imagine that when a local replica makes an edit to a node, it will reset the dot list and append a single dot to that node. I assume that merging will increase the dot list size (in the case of concurrent updates) but only a local edit (having seen all the concurrent updates) is able to shrink the dot list again.

At the end of your article, you also make the comment:

Diffs do require some more metadata, and for tracking removes they require some kind of tombstone. In Ditto's old Remove Wins Map the tombstone was a single dot. With Add Wins Maps this is more complex, and that is the subject of my next post!

This is perplexing to me as well - that a single dot tombstone is not sufficient for add wins behavior. In my mind I don't see why this is the case. If one replica adds a value and tags it with a dot and another replica removes the values, and inserts a tombstone/dot, when I synchronize, wouldn't I simply notice a concurrent edit to the set item, and then decide to use the "add" dot, not the "tombstone" dot?

Thanks so much for reading this, I really appreciate your time/help.

Hi @widavies, Thanks for the questions. I'll do my best to answer.

The main CRDT I've been studying is the delta-state CRDT map. Most implementations I've seen will structure this as a normal map with a version vector at the root and a dot list at each node. My understanding is that when a replica makes an edit to a node, it will update the version vector (by incrementing its counter) and append a new dot (using the same counter value) to the dot list at that node. You mention in one of your articles that if another replica makes an edit to that same node as well, when the local replica synchronizes with it, it will append the remote's value and dot to that node. The end result is one node with two concurrent values and two dots in the dot list.

This isn't quite how it is done. Each entry (node) has a set of dots. When a replica makes an edit it will replace the existing set with the single dot for that edit. This makes sense because all those observed (concurrent) dots are joined at superceded by the new event. The set of dots at a node capture the concurrency at that node. If there is no concurrency, there is a single dot.

You mention in one of your articles that if another replica makes an edit to that same node as well, when the local replica synchronizes with it, it will append the remote's value and dot to that node. The end result is one node with two concurrent values and two dots in the dot list.

Hmmm. No. It depends. Maybe you could link me to the article for reference? In the riak_dt and Ditto map we capture type conflict. If two sites update a node in the map with different types (Site A increments a counter at a.b.c and Site B updates a register at a.b.c) then we maintain both types. But in this case there are actually TWO entries in the map. The crucial thing we do in riak_dt_map is we don't name entries for key. If you insert a property a in the map, it will actually be a node in the map (a, KIND) where KIND is one of counter, map, register or any other supported type.

For concurrent edits to the same KIND then the dotsets of the two replicas are merged, and the CRDT semantics of the embedded type are used to resolve the conflict.

why wouldn't I apply the LWW right away during sync?

We do! But then you have to deal with oddness on concurrent remove and update of a property that contains a register. This is pretty complex, and involves more than 2 replicas. But the basic issue is Site A sets register a to X at time 1. Site B sets register a to Y at time 2. Now there is a Site C that will remove a. Depending on what it receives from Site A and Site B (order, timing etc) we can end up in the state where Site C remove the value Y but has not observed the valye X. The AddWins map semantics say that the update on Site A WINS over the remove on Site C. But the register value? Well we resolve that using temporal order. So what should the value be? if you say X then different interleavings lead to non-determinism. One way to avoid this is have the LWW not resolve on sync. Another is to treat the remove on Site C as setting the register to NULL and then removing it.

When does the dot list shrink? As I said above, when an update occurs the dot set contains a single value.

when I synchronize, wouldn't I simply notice a concurrent edit to the set item, and then decide to use the "add" dot, not the "tombstone" dot?

You can do this. But then the remove has NO EFFECT. Imagine a file system. On Site A I remove folder X which has 100 files in it (thus removing all 100 files). On Site B you add a new file to folder X, now it has 101 files. Using the method you describe all 101 files would survive on sync, where is with a more complex tombstone, the 100 files can be removed on sync, and the only part of the Add that wins is the existence of X and the one unseen file that you inserted.

This is further complicated by the need to send diffs only and distinguising "I didn't send you something because you've seen it" from "I didn't send you something because it no longer exists"

The diff and AddWins tombstone thing is pretty complex, and I need to write that article.

Let me know if any of this is helpful, or likewise, unclear.

Thanks again for reaching out.

Oh, I should say, all of the above are choices/trade-offs around semantics and system behaviour. You can select others and still be a CRDT. The semantics of the merge of CRDTs is always going to be up for debate, since none of them can ever really capture human intention fully. They're all weird in some ways in some set of interleavings.

@russelldb Thanks so much - this is super helpful! I do have a few follow up questions. Thanks again for all your help.

You mention that the values for concurrent edits are both maintained if there is a type conflict. That makes sense.

Next you mention:

For concurrent edits to the same KIND then the dotsets of the two replicas are merged, and the CRDT semantics of the embedded type are used to resolve the conflict.

How exactly are the dot sets merged? Assume two (concurrent) edits are made to a LWW (non-conflicting types). This means each replica has one dot in their dot set for that entry. When I merge the entries, I imagine that I would a) overwrite the value with the more recent edit (assuming the dot version is a HLC, so I have physical time). and b) the new dot set for each replica is the union of the dot sets of the replicas before merging. and finally c) merge the version vectors together.

The question I have is - why not just use a single dot instead of a dot set in this situation? In this scenario, two concurrent edits are made to the LWW entry. When merging, does anything go wrong if I a) overwrite the value with the more recent edit and b) overwrite the single dot (instead of a dot set) with the more recent single dot of the two and finally c) merge the version vectors? It seems to me that the version vectors would indicate that the less recent dot is contained in the history, so it would merge properly? When I think about it it seems like there may be some edge case where only maintaining the most recent dot (amongst the other concurrent dots) would produce an incorrect diff.

The article I was referring to is the one that contains this diagram:

I think what initially confused me is how entry a.b.c is synced. Instead of Peer E adding 12 to 45 and sending 57 to Peer A, it sends both dots and both values, which are both kept and stored. However, it seems to me that entry a.b.c is a CRDT counter, so a unique count needs to be maintained for every replica and this wouldn't be the case for a LWW.

Fortunately for the app I'm working on, my schema is fixed, so I don't allow key removals, which simplifies a lot.

If I'm using a fixed schema - I think a single tombstone dot would be sufficient for add wins right? In other words, I have an entry that is an add wins observed remove set. Every item in this set would have a dot set and each dot would have a deleted flag. Then, I would simply treat each item in the AWORSet like a LWW merge. When concurrent edits are made (remove and an add), the dot set is merged to contain all concurrent dots. If at least one of these concurrent dots has deleted=false, the item is considered not deleted, otherwise, if all the concurrent dots have deleted=true, then the item is considered deleted.

Thanks so much, let me know if you have a Patreon or anything I could donate to for all your help.

russelldb / russelldb.github.io

A few questions on CRDTs #8