nomad / cola

🥤 A text CRDT for real-time collaborative editing
https://crates.io/crates/cola
MIT License
493 stars 11 forks source link

Cola crashes when replaying multi-user trace #12

Closed josephg closed 1 month ago

josephg commented 1 month ago

Hi! I'm running a script to import the concurrent editing traces into cola. The script is crashing.

I captured all the calls to cola that my script makes to make a repro. My script uses cola in a bit of a weird way - in particular it calls clone() and fork(). Thats computationally inefficient, but it should be correct.

#[test]
fn blah() {
    let mut r0 = cola::Replica::new(10, 0);
    r0.inserted(0, 29); // id: 10, len: 29

    let mut r1 = r0.clone();
    r1.inserted(29, 4); // id: 10, len: 33

    r0 = r0.fork(11);
    let e1 = r0.inserted(1, 1); // id: 11, len: 30

    r1.integrate_insertion(&e1).unwrap(); // id: 10, len: 34
    r1 = r1.fork(11); // id: 11

    r1.inserted(34, 16); // id: 11, len: 50
    r1.deleted(49..50);
}

Anyway, at this point the script crashes. Its hitting the unreachable!(); line in Fragments::Arrays<_>::idx_at_offset. The data structure is invalid in some way.

internal error: entered unreachable code
thread 'convert1::blah' panicked at /home/seph/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cola-0.4.5/src/run_indices.rs:585:13:
internal error: entered unreachable code
stack backtrace:
   0: rust_begin_unwind
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14
   2: core::panicking::panic
             at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:146:5
   3: cola::run_indices::fragments::Array<_>::idx_at_offset
             at /home/seph/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cola-0.4.5/src/run_indices.rs:585:13
   4: cola::run_indices::fragments::Array<_>::split
             at /home/seph/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cola-0.4.5/src/run_indices.rs:618:42
   5: cola::run_indices::fragments::Fragments<_>::split
             at /home/seph/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cola-0.4.5/src/run_indices.rs:455:25
   6: cola::run_indices::ReplicaIndices::split
             at /home/seph/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cola-0.4.5/src/run_indices.rs:218:9
   7: cola::run_tree::RunTree::delete
             at /home/seph/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cola-0.4.5/src/run_tree.rs:234:17
   8: cola::replica::Replica::deleted
             at /home/seph/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cola-0.4.5/src/replica.rs:486:13
   9: paper_benchmarks::convert1::blah
             at ./src/convert1.rs:670:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

I'm using cola 0.4.5 from cargo.

noib3 commented 1 month ago

I'll open a PR to emit a better panic message, but the problem is not in cola, it's this line:

r1 = r1.fork(11); // id: 11

Here you're forking using an ID that's already in the original replica's history, but replica IDs are required to be globally unique across both space and time.

I've slightly refactored your test to avoid the last rX = rX.fork which I find a bit confusing. Hopefully the comments make this a bit clearer:

#[test]
fn blah() {
    let mut r0 = cola::Replica::new(10, 0);
    r0.inserted(0, 29);

    let mut r1 = r0.clone();
    r1.inserted(29, 4);

    r0 = r0.fork(11);

    // Insertion made by peer 11.
    let e1 = r0.inserted(1, 1);

    // Peer 10 integrates the insertion made by peer 11.
    r1.integrate_insertion(&e1).unwrap();

    // 💥 Peer 10 is forked with an ID already contained in its history.
    let mut r2 = r1.fork(11);

    r2.inserted(34, 16);
    r2.deleted(49..50);
}

Try replacing r1 = r1.fork(11) with r1 = r1.fork(12) and the test will pass.

noib3 commented 1 month ago

Btw, cola already contains all the scaffolding necessary to test concurrent traces in diamond-types format. Have a look at tests/concurrent_traces.

josephg commented 1 month ago
    // Insertion made by peer 11.
    let e1 = r0.inserted(1, 1);

    // Peer 10 integrates the insertion made by peer 11.
    r1.integrate_insertion(&e1).unwrap();

    // 💥 Peer 10 is forked with an ID already contained in its history.
    let mut r2 = r1.fork(11);

    r2.inserted(34, 16);

Ah!

The script I have works for yjs and automerge. As I understand it, Automerge keeps a known version vector for each replica. So in this case, it knows that the next edit made by replica 11 will come after all existing (known) edits, and hence this code works correctly. It sounds like cola can't handle this case in the same way.

(Fair enough - it shouldn't doesn't come up in normal usage.)

Thanks for pointing out that code - I think I saw it at one point but I didn't quite understand how your tests work. I want to benchmark how cola compares to some changes I've been making lately in diamond types. I made a new btree style data structure in DT inspired by your pair-of-vecs approach and I want to see if I'm getting similar performance to cola. Cola is currently much faster than the traditional CRDT implementation in diamond types, but the difference almost completely disappears if I comment out your gtree cursor caching. It looks like the concurrent traces aren't currently used by the benchmarking code. Is there a reason for that? Would it be hard to wire them up?

noib3 commented 1 month ago

It sounds like cola can't handle this case in the same way.

cola also keeps a version vector, but I think I'd rather use it to make the panic message clearer than to make the code above work, as I don't think dropping a Replica and re-creating it from another one with the original ID is something anyone actually does. 9 times out of 10 that's a bug, and I'd like to make that explicit.

Is there a reason for that? Would it be hard to wire them up?

I haven't looked at the concurrent trace code in a while, but I remember that it was a lot harder to reason about compared to the equivalent for sequential traces. Also, at least in cola, sequential and concurrent traces have exactly the same performance profile since creating or integrating an edit are basically the only 2 things a user can do, so I just went with the simpler option.

P.S. I've been thinking about redesigning the core data structure of cola to make it easier to implement a few features I have in mind, and I'd be interested to see if DT can reach the same performance using an OpLog. Also, I read somewhere that you were writing a paper on DT. Is that true? Has it been published?

josephg commented 1 month ago

Makes sense. In that case I'll just use the sequential traces for benchmark comparisons while I tweak this data structure implementation.

DT is based on a different algorithm now (egwalker) that isn't a "traditional CRDT". In this approach the shape of the causal graph does matter. We get essentially instant + free merging for sequential traces, and no resident memory usage while editing. But that comes at a cost of more complex code and slower merge speed when merging long running branches. The data structure is simply the "raw" operation (Ins/del + position) and "causal graph edges" describing when events happened relative to one another. (Kind of like git commit parents).

And yeah, we wrote a paper on it, which is currently going through the review process for a conference. Take a read. I'd love your thoughts on it.

http://home.seph.codes/public/reg-text.pdf

I've benchmarked the "downstream" merge time in the paper, and compared it to my own equivalent CRDT code, and automerge and yjs. But I think cola is significantly faster than all of those implementations - though its also a bit harder to use because its lacking some of the ergonomics of those libraries. Like there's no string / rope implementation in cola. And there's no way to get all the changes since some specified version through the public API.

I bet there's still a lot of performance on the table for DT - though optimising it is more complex because there's more moving parts. (The algorithm also does some quite complex graph traversals, which in some cases take up to half the merge time).

Are you up for a video chat at some point? It might be interesting to chat in realtime about this stuff. If so, flick me an email - I'm me@josephg.com

noib3 commented 1 month ago

Sounds good, just sent you an email.

I'm considering this solved, but feel free to re-open it.