rust-crdt / rust-crdt

a collection of well-tested, serializable CRDTs for Rust
Apache License 2.0
1.38k stars 59 forks source link

Support for recursive data types (e.g. JSON) ? #51

Open ngortheone opened 5 years ago

ngortheone commented 5 years ago

Hi @davidrusu

It looks like this library doesn't support for recursive data types (JSON) Are there any plans / obstacles for implementing something like ditto has ?

Is it possible to implement recursive data type based on existing simple CRDTs (List, Map)?

russelldb commented 5 years ago

It's worth pointing out Ditto's list and map CRDTs (that underpin the json document) are not collections of CRDTs. The list CRDT does not merge pairwise elements on merge, the Map is a LWW multivalue register, again, not merging contents recursively on merge.

riak_dt_map is a recursive CRDT holder data type (from back 2012) and there are others out there now (in fact, isn't the map in rust-crdt a recursive structure?)

ngortheone commented 5 years ago

@russelldb

isn't the map in rust-crdt a recursive structure?

Thanks for pointing out. After closely looking at the code I think that it might be. I will try it out in detail and share the results.

davidrusu commented 5 years ago

Yep, Map is indeed recursive, but it has the limitation that every value in the map is of the same type so it does not behave like a JSON CRDT where values may have different structures if that's what your goal is.

@ngortheone you're links on https://github.com/rust-crdt/rust-crdt/issues/50#issuecomment-537108463 are interesting. I certainly would not mind if a PR was opened with a port of the JSON CRDT to rust ;)

ngortheone commented 5 years ago

I have discovered CRDTs for myself only recently I feel like I still miss a lot of basics (never had a formal education in this area), but I'll see what I can do.

ngortheone commented 5 years ago

@davidrusu

Here are my thoughts on A Conflict-Free Replicated JSON Datatype:

Briefly:

Now in detail:

Since JSON data type is recursive and arbitrarily deep and schema-less authors had to solve some interesting problems: Because operations can happen on different levels of the tree there is really no good way of tracking causality and having sound merge semantics. Authors of the paper use one vector clock for the whole document.This implies that there is no way of specifying a different kind of CRDT for one of the document fields (e.g. counter vs register) In order to keep CRDT properties (partial order, lub) of edits authors came up with Cursor API that emits Operations. It seems to me that implementation of ) suffers from 2 kinds of unbounded growths - tombstones and buffer of operations. ( I asked the question here https://github.com/fthomas/crjdt/issues/83)

There is no defined way to transmit the whole state, each replica keeps all operations. In case a new replica arrives - it will get all operations to converge. Hence the aforementioned GC problems

What is your current stance/approach on garbage collection? It looks like you avoided GC (orswot - no tombstones) as far as I can see. And this topic is still not researched good enough - I did not find any materials about GC implementations.

davidrusu commented 5 years ago

I'm reading through the paper now, I'll have more well formed thoughts after but here's my thoughts

One vector-clock for the whole document

I would rephrase this as "We have a top level vclock contains the latest version from each actor", We need each nested value to also have it's own local vclock to make sure you can handle the "reset-remove" semantics.

ie. with starting state {"counts": {"a": 1}} and concurrent edits:

  1. actor 1 deletes the "counts" key ==> {}
  2. actor 2 adds the "b" key to "counts" ==> {"counts": {"a": 1, "b": 1}}

On sync, the result should be {"counts": {"b":2}. In other words, everything from "counts" seen by actor 1 has been removed, but the "b": 1 is new information unseen by 1. To track this we'd need a vclock within the "counts": {..} sub-CRDT.

No way of treating some field differently (e.g if you want a counter instead of a register)

The riak_dt map handles this by making the key's to the map crdt a tuple (<type>, <key_str>), e.g. (mv_reg, "user_name"). From my quick skim of the paper it seemed they do something similar? I'll have to read it more carefully.

Cursor API

Sounds good, the alternative is to use a nested Op structure similar to what we have for the Map crdt: https://github.com/rust-crdt/rust-crdt/blob/master/src/map.rs#L67-L74

We'll just need to make sure that we still maintain a causal context for data extracted out of the CRDT (to be displayed on a screen for example). Have you taken a look at the "ReadCtx"/"AddCtx"/"RmCtx" in this project, the main ideas is that all reads of the CRDT structure come with the necessary vclocks to safely make modifications to the CRDT, the examples are fleshed out more in the README.

Only Op based replication

That's a fair to start, eventually I'd love to see some support for state based replication for this CRDT as well, (the rest of the the CRDT's in this project are all hyprids, supporting both state and op based replication)

Tombstones / GC

With some modifications, we might be able to swap out the tombstone CRDT's to the non-tombstone variants, I'm not a fan of tombstones for the same reasons you state. To start it's probably safest to do a faithful port of the CRDT from the paper and then perform the surgery.

I'll probably have more thoughts once I've absorbed the paper more, but my take would be to do a faithful implementation and then we can iron out the details of adding state based replication and removing tombstones.

JMLX42 commented 4 years ago

Hello,

I would love to know if there has been any progress on this.

Also, considering:

may I suggest to name this a "Document" CRDT (cf the NoSQL lingua) instead of a JSON CRDT?

JMLX42 commented 4 years ago

Regarding how to represent strings, I would assume LSEQ can be used (AFAIK that's what ditto does for it's Text CRDT). But LSEQ does not implement Causal so it cannot be used as a Map value.

I tried using LWWReg<String, u8>, but the same applies: it does not implement Causal.

What would be the best CRDT (combination) to represent a string value in a recursive Map-based JSON-like document?

The riak_dt map handles this by making the key's to the map crdt a tuple (, ), e.g. (mv_reg, "user_name"). From my quick skim of the paper it seemed they do something similar? I'll have to read it more carefully.

I have a feeling this is somehow an answer to my question above. But as I'm really new to CRDTs I'm not sure.

russelldb commented 4 years ago

Something like an RGA is causal and can be used in a recursive map like CRDT.

JMLX42 commented 4 years ago

Something like an RGA is causal and can be used in a recursive map like CRDT.

So strings would be RGA<u8>?

russelldb commented 4 years ago

I guess they could be. It depends, are these strings expected to be long? Like are you really talking about "text"? There is a large, rich, complicated field on collaborative text editing, and it's quite the rabbit hole to go down. It depends on the behaviour the string should support. It could just be a LWW-register, it could be RGA or there are structures for text editing, which is...a lot of work.

JMLX42 commented 4 years ago

It depends, are these strings expected to be long? Like are you really talking about "text"?

If the reference for NoSQL-like documents is MongoDB and BSON, then a string has an i32 max length. I guess that qualifies as "long" (as opposed to SQL's VARCHAR(255) for example).

Still, in this kind of use case, the string does not support "collaborative text editing" features: it's an UTF-8 byte array and its entire value is set at once. So it is a register. And I would go as far as saying that, for that specific JSON/BSON/NoSQL document use case, all values except arrays and maps (so all JavaScript's "original" pass-by-copy types) are considered registers.

russelldb commented 4 years ago

So you don't need an RGA for strings, just a LWW register. In the Riak DT map (and other proprietary JSON like CRDTs I have worked on). Strings have been LWW registers. If you think of CRDTs as containers (Array, Map, Set) and values (Counter, Register) then you can make a JSON like thing with those alone (you don't even need the Set.)

JMLX42 commented 4 years ago

If you think of CRDTs as containers (Array, Map, Set) and values (Counter, Register) then you can make a JSON like thing with those alone (you don't even need the Set.)

Agreed.

I still wonder how the ops should be structured though. I can't see a way to have document level CRDT ops that would not store some kind of a path to the target inner field. Since ops are serializable, that path must be serializable too. AFAIK, ditto did it using JSON pointers (example here) and I still see no better way to do that.