peer-base / js-delta-crdts

Delta State-based CRDTs in Javascript
192 stars 16 forks source link

Issues with rga and joins #38

Closed jimpick closed 5 years ago

jimpick commented 5 years ago

I did an extensive investigation [1] [2] [3] [4] into the random corruption problems in the peer-base collaboration-random test with high concurrency.

Occasionally, when we batch deltas together with .join(), the combined delta is created "backwards" ... and the process of applying the delta will corrupt the RGA.

I've crafted a simplified example here:

https://gist.github.com/jimpick/a6a75b7db561707486c669cfa935e966

I believe that there is a fundamental problem in that the .join() method doesn't have enough context to know what order the deltas are supposed to be combined in when creating batches. We haven't seen this popping up too often because most of the time, the deltas are 'connected', and the join algorithm can handle that. When the deltas are in separate 'fragments' separated by some distance, the algorithm makes a guess based on the CRDT ids and clocks, but that can be wrong in some situations, which results in corruption problems.

For peer-base, a short-term fix would be to disable the batching algorithm.

Some possible fixes:

pgte commented 5 years ago

@jimpick I've analysed your gist and here is what I found out:

Towards the end, in this line, you're joining the P and R deltas by themselves. When attempting to do that in peer-base, it would fail here (because in peer-base we are strict about the join. When attempting to join P and E, RGA throws an exception. This would lead to the P and R to be sent as separate deltas because of this error handling code in peer-base.

Also, R would not be applied before A because R is causally dependent on A (in peer-base, the clock for A would be included in the clock for R).

So I'm failing to see how something like this would happen in peer-base.

@jimpick thoughts?

CBaquero commented 5 years ago

Hi,

When coding the join for deltas its easy to assume that they will be causally consistent and make simplifications, that will break the join later if applied to deltas that are not consecutive. This wrong, and the join should be agnostic to consistency requirements among the parts to be joined. So its important to have that in mind when coding them initially. My guess is that its always possible to code them like that, even if sometimes in a less efficient way.

Regarding the RGA implementation, I don't know the details but this hidden (wrong) assumption is probably the cause of the issue. Ideally the fix should be to avoid the requirement for causal consistency, and not prohibiting joining deltas thats are not consecutive or fully causally consistent.

To sum up the assumptions:

This means that in general there is no consistency in the deltas and delta-groups in transit and that causal consistency is only enforced at the end-point replicas.

jimpick commented 5 years ago

@pgte I added the 'strict' check so I could relax it for the situation where we are creating a batch.

After sleeping on the problem, I realized that (I believe) all we need to do to create a "batch" of deltas (for RGA only) is to construct a Set of the vertexes we want to send from the base state, and then traverse the state, filtering on the set.

I don't know what to call that ... it's not really a .join() ... it's more like the "intersect" operation on a Set.

It should be trivial to code for RGA ... I'll try to make a prototype as a PR against peer-base - but peer-base shouldn't really know about the RGA internals, so if it works, we should figure out what the proper delta-crdts API would be.

pgte commented 5 years ago

@CBaquero thanks for your input!

@jimpick I'd like to see a more generic approach, so I'm going to attempt to make this RGA implementation not dependent on causal delivery. This PR (WIP) is for that.

pgte commented 5 years ago

@jimpick also, I added your more recent test case (the "pear" one) here: https://github.com/ipfs-shipyard/js-delta-crdts/pull/39/files#diff-172f1ba1943afcefe3572818181a93ff

pgte commented 5 years ago

Inspired by @CBaquero comment above, I created this PR to make the RGA implementation buffer unmergeable edges. This allows us to use a transport that does not provide causal delivery and, better yet, to arbitrarily batch deltas that don't contain the entire causal dependencies.

Also, this PR adds tests that exercise this capability by testing many permutations of the generated deltas.

These tests show that this issue can be closed.