orbitdb / field-manual

The Offical User's Guide to OrbitDB
208 stars 43 forks source link

Understanding the background of orbit-db #95

Open bronger opened 7 years ago

bronger commented 7 years ago

I try to understand how orbit-db actually works. Is there an every-growing chain of DB entries that every node maintains, and every node is publishing their HEAD to an IPFS pubsub room?

And at the same time, every node is also listening on that room, fetching the HEADs of others (or is there only one at a time?), merging it with theirs and pushing the result back?

haadcode commented 6 years ago

Sorry for the delay in answering @bronger. You got it almost right!

Every node maintains a chain of DB entries per database, ie. only the entries of the database their using (listening to). The ones listening to a database, receive those HEAD updates and upon receiving them, they merge the chain from the HEAD with their local chain. They don't need to push the results back to the network as the updates are the same for all --> same data for all.

Does that answer your question?

bronger commented 6 years ago

Thank you! So, you can guarantee that everyone has all data, but you do not guarantee that everyone has the same history (in Git terms)? Does this mean that you never do – also in Git terms – a fast forward, but only a merge?

haadcode commented 6 years ago

orbit-db has strong eventual consistency. If all nodes have all the updates (ie. the same head and they have fetched all the database entries), then yes, they all have the same history, too. What can't be guaranteed is that all nodes have the same data at any particular time.

bronger commented 6 years ago

I wonder how much of Git's model is in OrbitDB … maybe I assume too much. Anyway:

Assumed that node A and node B both committed updates to their local copy of the DB at the same time.

Node A publishes a new HEAD_A first, and Node B merges it with their current head (it must merge, because both histories diverged), which results in HEAD_B. It doesn't publish it, though, thus, both nodes have different heads in this moment.

There are some methods to become fully synchronised … which method does OrbitDB use?

haadcode commented 6 years ago

Node A publishes a new HEAD_A first, and Node B merges it with their current head (it must merge, because both histories diverged), which results in HEAD_B.

From the data structure perspective, it's a new HEAD only after B adds a third entry, upon which the third entry would have pointers to [HEAD_A1, HEAD_B1] giving us HEAD_B2 (assuming HEAD_B1 was the first add operation for B).

Does that clarify it?

For more details, you can take a look at https://github.com/haadcode/ipfs-log which is the operations log (a CRDT) in orbit-db and which handles the "chain" structure of a database.

There are some methods to become fully synchronised

Not sure what you refer to here, but FWIW eventual consistency in orbit-db means that the updates to a database will eventually reach all peers, but there's no way of forcing the peers to reach and have the same state at/for any particular time.

I hope this helps. Please do keep asking if it's not clear :)

bronger commented 6 years ago

The OrbitDB Log uses CvRDTs, doesn't it? In other words, the HEAD (representing the state of the DB) converges to a HEAD common to all nodes.

According to https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type#State-based_CRDTs, the merge function must be commutative, associative, and idempotent. Git's merges are not, but Orbit implements such a merge? Is this CvRDT-compatible merge the reason why HEAD_A and HEAD_B are the same?

I think about keeping Git-like (not really Git) trees synchronised on top of OrbitDB, or directly on IPFS pubsub. Since Git merges are not CvRDT-compatible, I must find a merge algorithm on top of it which is. My plan is to mark merges with other nodes' HEADs on the local node as “weak”. If a HEAD is published with pubsub, all of its merges become strong.

When another node publishes a new HEAD, the following list is worked off:

  1. If the other HEAD is part of my HEAD's history, do nothing.
  2. If my own HEAD is part of the other HEAD's history, fast-forward to the other HEAD.
  3. If fast-forward to the other HEAD would result in losing only weak merges, do it.
  4. Else, merge the other HEAD with mine, marking the merge as weak.

(2 is not really necessary, but may make things clearer.)

Does OrbitDB do something along these lines?

haadcode commented 6 years ago

The log in orbit-db is a CmRDT. Each update to the database (state modification) is its own operation in the log. The merge is indeed commutative, associative and idempotent and merge algorithm is: pick the higher lamport timestamp, if timestamp are the same (concurrent operation), pick the higher ID. This gives us a deterministic merge to achieve those properties.

To clarify on the HEAD discussion, I believe I'm talking about it as "latest entry == HEAD" and you're talking about "HEAD == state of the data structure" which are different but in orbit-db are effectively the same thing. In orbit-db's case, the log (ie. the head of the log) doesn't get sent to pubsub, but rather the added entry (ie. the latest entry which points to previous entries). There's no concept of HEAD internally in orbit-db nor in the log.

So, effectively what you describe in steps 1-4 is what happens in orbit-db's log when a database entry is added. From that perspective I believe what you're after is what orbit-db does currently.

I'm not sure what the function of the "weak" flag is in your data structure, is there a specific need for it? orbit-db doesn't have such notion. Furthermore, if you have a need to automatically merge the trees together, that's something orbit-db doesn't do atm and you'd have to write a new CRDT for it (I believe there is a known tree CRDT) or write a custom orbit-db store that can handle trees as database values. In this regard orbit-db is data agnostic. What orbit-db can do for you is to tell whether tree A or B is the latest, but it can't merge the trees.

bronger commented 6 years ago

Is the processing of new entries described in https://github.com/orbitdb/orbit-db-store/blob/master/src/Store.js?

I've had additional time to think about this.

The problem occurs, of course, if DBs diverge. You say that one orbit-db entry points to the previous one. Now, if a node publishes an entry with an outdated timestamp (i.e. older than the latest entry on other nodes), do the other nodes include it at the proper position in the entry chain and re-point the following ones (i.e. change the link to the predecessor because its hash changed)? I cannot find it in the source code, I'm afraid.

In the following, I use “HEAD”, “merge”, “rebase”, and “fast forward” as Git terms.

I need the concept of weak merges because out-of-the-box, Git merges are neither commutative nor associative. It's easy to make them commutative by saying the first parent must have the lowest ID etc (like you describe above with deterministic merges).

But I see no way to make them associative without rebasing, i.e. throwing away commits. Consider this ASCII art:

         5       7               
        / \     / \              
       4  |     |  6          8  
     / |  |     |  | \      / | \ 
    1  2  3     1  2  3    1  2  3

Thus, even with deterministic merges, the commits 1, 2, and 3 may be merged into three different merges 5, 7, and 8. This leads to continuous publishing/merging of the nodes because they can never agree on one HEAD => no convergence.

But if I mark 4, 6, and 8 as “weak”, I get convergence by my four rules in the previous posting, because nodes may take the other's HEAD => convergence.

I can restrict this (surely unfortunate) behaviour to those weak merges, and probably I can live with it.

aphelionz commented 4 years ago

Moving to the Field Manual repo for more discussion