CRDT: Snapshotting - Githubissues

pgte commented 6 years ago

Have been discussing this A LOT in the last few days with @diffcunha, @olizilla, @achingbrain, @miyazono, @diasdavid and some others.

Problem

Right now, new peers coming into a CRDT will have to download the entire operations tree in order to be in sync.

Proposed solution

Basic idea

The basic idea is to somehow eventually include a snapshot in the local log, which is then naturally replicated into other peers. This snapshot logger entry can then be used by other nodes for a quick catch up without incurring the cost of replicating the entire log chain.

Besides a the current state , this snapshot needs to somehow include the tree before it, but without the operations, making it a compression for the log tree before it. In effect, this snapshot is the log tree before it without the operation data.

A good analogy for this may be key frames in videos, where, for a new node to produce video from a live stream, it doesn't have to process the video since the beginning.

Processing a remote snapshot

Our thesis was that, when receiving this snapshot, a peer would have to check if it conflicts with any existing entry. If it doesn't conflict, the snapshot is integrated, and the peer doesn't need to ask for more log entries to proceed. If it conflicts, the snapshot can be discarded and we can proceed to process the parent log entries.

Producing a snapshot

A snapshot could then be produced at will by any peer as a kind of checkpoint for other peers. A peer could, for instance, decide to create a snapshot after X changes, or once a new peer enters the CRDT, with a given probability (to prevent a snapshot storm).

Also, for performance purposes, this snapshot would not sit directly inside the log tree, but a separate DAG node would be created for this and then be linked from the snapshot log entry. This way, when receiving this, nodes could then choose whether to download the snapshot or not.

What do you think of this solution?

dmonad commented 6 years ago

I want to share some of my insights of the snapshot approach:

In Yjs@v12 I offered a storage layer that basically persists all changes on the shared document as separate entries in the local database (indexeddb or leveldb). But as soon as users make more than +10k changes on the document, the initial load increases drastically.

In Yjs@v13 I use a snapshot approach. Changes on the shared document are maintained in a list that will never grow beyond 100 entries. When the 101th entry is produced, Yjs saves a snapshot of the document and deletes the list of 100 document updates. The initial load is now much faster. Though large snapshots are noticeable by the end-user (e.g. while typing). A possible fix for this would be to produce snapshots when the client is idle anyway.

I'm currently thinking about how we could mirror the Yjs persistence approach to IPFS. The advantage of combining y-indexeddb with IPFS would be that the client can provide the data to the IPFS network when it is online.

I wonder if it is possible to collaboratively create a mutable tree? I assume that a mutable tree would work like this: We have a top level node { data: update1, links: [], hash: 'a1' } a IPNS namespace '${name}' links to hash 'a1'. A client can update the mutable tree by updating the namespace'${name}'. But what happens if two users update the same namespace at the same time?

So I think we basically need a namespace for every user that participates, is that correct?

How efficient is it to update a namespace for every change on the document?

pgte commented 6 years ago

@dmonad I don't think IPNS would work for this, unless each peer uses a separate namespace, in which case you'd need a way to discover these.

In my idea, the local operation log is a chain of log entries, where each entry results in a hash and has the following data:

ids of the parent entries
operation data

To mutate the CRDT, a peer appends to the local log an entry with [previousOperationHash, operationData]. This results in a hash, which is then broadcasted as the new HEAD.

Other peers, once they receive this update, they fetch the missing operations. If any operation conflicts, they create a merge node ([[parentLogEntryHash1, parentLogEntryHash2]]), which also results in a hash, which is now the new HEAD. They process new entries in causal order.

My idea here for the snapshots is to append them to the local log as an operation-less log entry, being replicated as a normal log entry.

I would use IPFS as the normal (offline and online) store, as IPFS works well as a persistent immutable content-addressable data store..

This is how I would translate a CRDT persistence into an IPFS structure. @dmonad I don't know much about the Y.js data structures — Do you think this makes sense?

dmonad commented 6 years ago

You don't really have to think about causality when working with Yjs. Yjs reconstructs causality pretty easily using Lamport Timestamps.

I would use IPFS as the normal (offline and online) store, as IPFS works well as a persistent immutable content-addressable data store..

Makes sense. I didn't know that js-ipfs works offline yet. Cool!

How does this HEAD approach work? I assume it is similar to link a hash to a namespace. What happens if two users do that at the same time? What happens if a user with a slow network connection overwrites HEAD and deletes previous updates? What happens if a user who produced changes offline joins the network? This requires the ability to somehow merge conflicting changes..

I am concerned to give the responsibility of maintaining the graph to a single peer.

The client may be unresponsive
It does not work offline

Of course we could implement some kind of merging approach when a peer detects that updates are missing in the remote store. But this seems pretty error-prone as conflicts will happen very often. This is why I want every user to have their own namespace.

@dmonad I don't think IPNS would work for this, unless each peer uses a separate namespace, in which case you'd need a way to discover these.

I don't think of this as a big problem. A top-level node (addressed as roomname) could handle the list of available user-updates. When a user joins the network, it makes sure that namespace roomname/#userid is available on the top-level node. The conflicts that arise when setting the room-name should be easier to handle (e.g. by exponential back-off). From then on the user can publish its updates conflict-free. For a room with n users we have to check n namespaces.

pgte commented 6 years ago

You don't really have to think about causality when working with Yjs. Yjs reconstructs causality pretty easily using Lamport Timestamps.

Makes sense, and has the same effect as what I described. Which means that if local causality is represented in Y.js (with vector clocks or whatever), then it's easily translatable into a DAG to be able to store it in a content-addressable store like IPFS. (A vector clock represents a tree of events, much like an IPFS DAG).

How does this HEAD approach work? I assume it is similar to link a hash to a namespace. What happens if two users do that at the same time? What happens if a user with a slow network connection overwrites HEAD and deletes previous updates? What happens if a user who produced changes offline joins the network? This requires the ability to somehow merge conflicting changes..

Each node reports it's own HEAD on the same pubsub channel. This means that, when receiving a HEAD, from any peer, a peer has to check if that hash is included in the current log (easy to do when you have a DAG). If it doesn't, it fetches all the missing entries.

I am concerned to give the responsibility of maintaining the graph to a single peer.

The client may be unresponsive

It does not work offline

No need to worry about that, no SPoF here. :)

Of course we could implement some kind of merging approach when a peer detects that updates are missing in the remote store. But this seems pretty error-prone as conflicts will happen very often. This is why I want every user to have their own namespace.

By conflicts you mean concurrent edits? No problem with that, they're two branches of a DAG tree that eventually get merged together.

I don't think of this as a big problem. A top-level node (addressed as roomname) could handle the list of available user-updates. When a user joins the network, it makes sure that namespace roomname/#userid is available on the top-level node. The conflicts that arise when setting the room-name should be easier to handle (e.g. by exponential back-off). From then on the user can publish its updates conflict-free. For a room with n users we have to check n namespaces.

I don't think there is the need to have one sub-room per user, as I described. :)

@dmonad makes sense?

If you're interesting in some illustrations and perhaps more detailed explanation, I talked about this mapping in a recent IPFS meetup, video here: https://www.youtube.com/watch?v=2VOF-Z-nLnQ&t=878s (CRDTs on IPFS).

dmonad commented 6 years ago

Thanks for the explanation @pgte! Sorry for not answering sooner..

I got the HEAD approach now. So this is specific to the pubsub implementation. Do you think there is a way to persist the HEAD(s) in the network, so we can use it without any collaborator participating?

pgte commented 6 years ago

@dmonad the HEAD is not specific to the pubsub implementation. It uses the generic pubsub mechanism (where peers subscribe to a given topic) for peers to know about each other's HEAD content ids.

I don't see a need to persiste the HEADs in the network, since, when receiving a HEAD from a remote peer, a peer fetches and integrates the missing log entries, and, if it conflicts, creates a merge operation, and that operation is the new HEAD. If it doesn't conflict, it will assume the latest operation content id as the new HEAD.

dmonad commented 6 years ago

Something I find limiting in peerpad is that it is not possible for a new peer to open the document without having a peer already in the session. This is why I asked if it is possible to store the HEAD in the network.

Ideally we use pubsub only to notify users about changes. I think it would be great if we could store the document directly in the IPFS network - so any user can access it anytime.

Or do you already have a solution for the above constraint?

pgte commented 6 years ago

Ha, I understand your question now. I see two generic paths to achieve persistence:

Replication node

Have a bunch of nodes that are permanently connected and replicate a CRDT (which is possible to do in peer-crdt without knowing the contents).

Put it on the DHT

IPFS has a DHT, which can be be used to make content available while the original provider is offline. In theory, a new participating node would be able to bootstrap from that. For me, this path is not yet totally clear, as we would require multiple writers to be able to write their state without overlapping each other, and for other peers to resume from those multiple states...

dmonad commented 6 years ago

My argument is that if we have a replication node anyway, it would be more efficient to use the Yjs storage model and let Yjs sync with the other peers.

It would be awesome though if we could use the existing IPFS technology and build a distributed app on top of it. With IPNS it is already possible to do that (though it might be inefficient - I don't know that). The advantage to use existing technology is that it will work with clients that don't know about CRDT.

I will investigate DHT in IPFS more tonight. This sounds like a really good solution.

daviddias commented 6 years ago

New paper "Efficient Synchronization of State-based CRDTs” https://arxiv.org/abs/1803.02750

fritzy commented 6 years ago

To get back at the topic at hand, I think snapshots would be useful locally, but not globally. If we can make the DAG traversal of operations efficient on the local machine, what we actually need to optimize is network. Since each DAG object must be retrieved in serial, then the sharing of DAG rollups would remove the need to do this.

If I join a pubsub channel with a particularly old HEAD, another client could tell me about a rollup, which includes all of the DAG ids (if not the contents). If it includes the contents, then it needs to retrieve this object as a single fetch, and can then merge it locally. If it does not include the contents, it can do the fetches of the individual objects in parallel, and then resolve locally.

Snapshots remove the ability to merge in old branches later, and again, can be done locally for efficiency as needed but not shared. Otherwise, snapshots either need an authority (client/server somewhere that declares what the state is with authority) or a consensus (which is rather involved).

Rollups, however, remove the need for serial fetching, don't preclude local snapshots, and can be confirmed as authentic with signatures at another layer.

pgte commented 6 years ago

@fritzy what do you mean by "rollups", and how are they different from "snapshots"? If a rollup is a grouping of several operations bundled together, I think this should be treated as a sync optimisation and should be handled at the sync layer (graph swap or graph sync, whatever it's being called these days..).

The problem here is that this does not scale well in time, as the volume of operations grows infinitely. (I think of this as a movie client having to read the stream from the start to be able to play at any position. Key frames solve this problem.).

I think that, much like the keyframe analogy, to solve this we would need to compress the operations in a way where we can afford to lose some information. But, unlike video, how to we allow concurrent edits in real-time?

In this sense, snapshotting would be to expose the internal state in a deterministic way. For a snapshot to be deterministic, I think we don't require a random initiative followed by vote-and-consensus. Instead, we could come up with a rule that determines when a snapshot should happen. This could be something like "if the hash % some number === 0, then we expect a snapshot node right after".

As an optimisation, a snapshot node doesn't need to contain the snapshot itself, only a link to a snapshot, to be fetched if the node requires it.

Thoughts?

fritzy commented 6 years ago

What about snapshots that occur during a fork or netsplit? What about snapshots created by a malicious client?

I don't think you can have a shared snapshot without either authority or consensus.

fritzy commented 6 years ago

Rollups are either lists of dag nodes, or the dag nodes themselves as one object. Yes, it won't scale forever, but perhaps CRDTs are of limited practical use after a non-trivial number of entries. This might be an application layer problem since an application can choose an authority/author for a particular document to have the final say and archive it with a snapshot. If a snapshot includes a rollup from the beginning, it should be usable as well. But I don't see how we can use a snapshot by itself and not lose data, and you would have to explicitly trust the snapshot author for ALL of the data, and not just their contributions.

fritzy commented 6 years ago

I suppose if a snapshot refers to a specific head, are signed, and you trust that user, they'd be okay (as you described @pgte). It's just that you can't merge anything previous to the snapshot unless you retrieve the full DAG eventually. I suppose it's a workable trade-off in some cases. I'm jumping around on this, I know.

pgte commented 6 years ago

@fritzy no worries, this has been a long debate, and I think we're getting to the point.

1. Trust

As you say, snapshots are signed and, since the CRDT protocol does not solve the problem of malicious intents, snapshots should be trusted: If a node that has write access and wants to, for instance, erase the entire document, it can. Of course, there could be non-malicious errors in that snapshot, but this also happens if the operation data is somehow corrupted because of a bug. The final implication of this is that, if a replica has a snapshot and has write permissions, I see no other option other than accept it's validity.

2. Concurrency

So I think that the problem that remains here is how to handle concurrency. Generally, I think that the solution here is to include the snapshots in the causal tree and provide a deterministically convergent way to reason about them. More specifically:

2.1 Quick catch-up

When a replica receives the news of a new HEAD from a remote replica, it will start retrieving the missing operations, receiving them in reverse order of their applicability. They will first receive the most recent operations. If, instead of being an operation, the DAG node it receives is a link to a snapshot DAG node, it can decide whether to apply it or not by following this rule:

Is the snapshot a descendant of my latest HEAD? If it is, we can retrieve and apply the snapshot. If not, we ignore the snapshot.

This allows a replica to safely apply a snapshot, only when it's more recent than my current state and not concurrent. The implication of this is that, in order to determine this, the snapshot must contain a compression of the causal tree. This could be achieved by either:

a) using vector clocks (every operation and snapshot now has to have a vector clock), allowing you to quickly infer if a snapshot is concurrent, included or advances the current local HEAD. Or

b) the snapshot has a compressed representation of the causal tree inside it (going back to a maximum of X DAG nodes), allowing replicas to infer causality. If not enough information is provided to infer that (a node may be too far behind), then a replica must keep resolving the tree until it either a) resolves the whole thing or b) finds an applicable snapshot.

I personally prefer a).

2.2 Recover from partition

What if a replica receives a new operation that depends on another operation that happened before a snapshot that I applied? How does this replica know if that operation is concurrent (and therefore should be applied to the state) or, on the contrary, is already included in the snapshot that I recently applied and should therefore be ignored?

I think that the answer to this last question is, again, either using vector clocks or compressing the causal tree into the snapshot.

pgte commented 6 years ago

This recent discussion on the GC thread about causal consistency also applies to fast booting: https://github.com/ipfs/dynamic-data-and-capabilities/issues/2#issuecomment-392461046

pgte commented 6 years ago

Dumb question: isn't a valid snapshot the internal state + a the local vector clock? Wouldn't sending this state be enough to bootstrap new replicas be enough?

As I see it, they would be able to:

get and incorporate operations in the future of the local vector clock (as long as all the dedendencies are already incorporated)
discard operations that are in the past of the local vector clock, as they are already included in the current state
get and incorporate operations that are concurrent to the local vector clock (as long as all the dedendencies are already incorporated)

Thoughts?

pgte commented 6 years ago

@gyounes with your op-based CRDTs knowledge in hand, would you say something like this (see entry above) would work? My intuition tells me that yes, but it has fooled me before..

gyounes commented 6 years ago

I guess partially yes, it works but it is not complete. 1) A peer would join to contribute, and therefore must have seen everything past to their join event before they can make any changes. 2) Another thing is that after a peer joins, changes about membership should reflect this join. For instance, the use of vector clocks, where each entry represents a peer, should have an additional entry for this peer after the join.

To solve this correctly and safely, a new peer wanting to join has to:

join through an existing peer that would send a join event with a vector clock to all other peers.
get from that existing peer state previous to the join event and only receive from others after this is done.
When each peer observes the join event of new peer they can start to send to him and update vector clocks to have a new entry for that peer.
When the existing peer that sent the join event knows that all other peers have seen the join event, it can notify the new peer that it is safe to start editing.

@pgte sorry for the late reply, did I answer you question?

pgte commented 6 years ago

@gyounes I see, this join protocol is to guarantee that all peers must know about the new peer before it starts contributing so that it’s contributions can be integrated later by everyone. But in a network with high churn this may be hard to achieve.. What if, when a peer joins, it’s sufficient that one of he nodes acknowledges it and integrates the new peer in it’s “causal stability table”? This way, this “parent node” would not trim history before the new peer’s state. One problem might be that, if the “parent peer” dies, this new node may be in trouble..

pgte commented 5 years ago

Right now we opted for ∂-state-CRDTs to avoid having to deal with this issue. May reopen this thread in the future if we decide to use op-based CRDTs.

ipfs-inactive / dynamic-data-and-capabilities

CRDT: Snapshotting #14

Problem

Proposed solution

Basic idea

Processing a remote snapshot

Producing a snapshot

1. Trust

2. Concurrency

2.1 Quick catch-up

2.2 Recover from partition