Implement databases over IPFS using Persistent Balanced Trees

eladve commented 8 years ago

Here are some notes for implementing a more-scalable version of Orbit-DB using persistent balanced trees. This came out of a discussion with @haadcode . Before implementing it, I'd love to get the community's feedback.

Currently, Orbit-DB operates as a key-value storage (i.e. a Python Dictionary / Go Mapping / Javascript Object) over IPFS. It supports insert, change, delete and lookup. The running time for all change operations is O(1) (by simply adding a node to the changelog and publishing the new head), but the lookup time is O(m), where m is the number of previous operations that already happened in the DB. In principle, the total cost of all lookup times at a particular node can be made to be O(m) by keeping an up-to-date version of the DB locally, and updating it according to the changelog, but this is unpleasant, and requires local storage: it requires us to keep track of everything that's happening in the DB, just to be able to do lookups: this is unscalable. For a chat application it's actually fine, since the user is presumably interested in anything that gets posted to the channel, but for other applications it won't do.

My alternative suggestion is to implement the DB as a semi-balanced search tree (e.g. AVL, 2-3, Red-Black, etc), indexed by the key, using Path Copying (I explain the details below.) With this approach, lookups will always be sent to the remote IPFS storage, and a client does not have to store anything locally. All operation times will take O(logn) time, where n is the number of keys currently in the DB. This approach will also allow us to perform look-ups in previous versions of the database (just like the current version allows). Also, when we implement this approach, the resulting codebase will provide a good template or proof-of-concept for DBs that support other operations. For example, we can create a database over IPFS that supports prefix searches (e.g. for searching for text in a chat log ) by implementing a suffix tree style data structure, with Path Copying, over IPFS. Also, since the operation times will now be scalable, we won't have the separate orbit-db into "channels": instead, we can keep the entire DB at one place, and use the channel name as part of the search key.

The implementation I suggest will work as follows: The DB will always look like a tree (e.g. AVL Tree): There will always be a "root node", which is the most recently published root on the pubsub system for this channel. Each node will have a key, key_0, a value, val_0, and two IPFS pointers to children (any of which can be null). All keys on the right side will be larger than key_0, and all keys on the left side will be smaller than key_0. For a lookup operation on the most up-to-date version we just start from the pointer to the root node, and we make our way down in binary search fashion until we find the requested key or discover that it does not exist in the tree. Now for updates: if we wish to update the value related to a certain key, we find the node holding this key, call it v. Obviously we can't change v, because we are working in an immutable system. Instead, we create a new copy of v, call it v',which is identical to v, except that the value is changed to the new value. Now we need to update the parent of v, call it u, to point to v' instead of v. But, of course, we can't do that, because IPFS is immutable. So we take u, and make a copy of it, u', where u' is identical to u, except that the pointer to v is replaced by a pointer to v'. We go up like this in the tree, until getting a new copy of the root, r'. Now we take r', we make it the new root, and we publish the IPFS pointer to r' in the pubsub system as the new root. What do we do with the old root r? Well, there is still a pointer to it in the pubsub system, as the old root, so anyone who wishes to access the old version of the tree can still do that by going through r. We should make sure that timestamps and other metadata is published along with r' (and maybe with each vertex), in order to aid people at finding the version they want, in case they want it. For the chat application, and for most other applications, we'll actually only be interested in the newest available root. Timestamps and metadata will also help clients know the freshness of the data they're getting. Summing up, the change operation took O(height) time, and created O(height) new IPFS blocks. In an AVL tree, height=O(logn), so we're good.

How do we implement insert and delete operations? We do them like in the underlying tree (say AVL). We find the node to delete, or the place where the insertion should occur, and we do a classical AVL operation, except that each time we need to modify a node, we instead create a new copy of it, modified in the appropriate places, and we make sure that the parent points to the new copy rather than the old one. One can check that because an AVL operation operated along a root-to-node path anyway, only O(logn) nodes will be copied, and only O(logn) operations will be needed. (This also follows from the classical performance analysis of the Path Copying method.)

For performance sake, we can store a whole chunk of the tree in one IPFS object, instead of just a single node: this might lower overheads.

To know more about this approach to implementing data structures, here are some resources. What I described here basically amounts to an implementation of partially-persistent data structures. While the original motivation behind persistent data structures was to allow access to old versions, that academic field of knowledge can also be re-appropriated to implementing data structures over immutable storage, where we simply are not allowed to delete old data. Research on persistent data structures was originally motivated by functional programming, because all objects in purely functional languages are immutable. This once again shows us how blockchain and decentralized systems are a perfect fit to functional programming: functional programming is great for security and for avoiding bugs, it is great for program verification, it's great for avoiding weird synchronicity bugs in distributed systems, and it naturally makes objects immutable, which means that any data structure written in a functional program is automatically ready for IPFS. For example, any implementation of the data structure I describe above in a functional language will automatically be trivially portable to IPFS; this is not necessarily true for a Go/Python/JS implementation. One more thing: another setting where immutability was needed is flash memory and other kinds of write-only memory (or limited-rewrite memory, like SSD drives and USB drives). This means that the literature on data structures on flash-memory/SSD might contain interesting ideas for implementing various data structures on IPFS.

Here are some myriad links about the topic of Persistent Data Structures: https://en.wikipedia.org/wiki/Persistent_data_structure Lectures 1,2,3 at https://courses.csail.mit.edu/6.851/spring14/lectures/ http://www.cs.tau.ac.il/~haimk/papers/persistent-survey.ps https://www.reddit.com/r/rust/comments/3e000u/survey_persistent_data_structures/ http://www.cs.cmu.edu/~rwh/theses/okasaki.pdf http://www.cs.tau.ac.il/~haimk/papers/journal2.pdf https://www.cs.cmu.edu/~sleator/papers/making-data-structures-persistent.pdf More references can be found here (search for persistence, and expand the link): http://jeffe.cs.illinois.edu/teaching/datastructures/schedule.html

A final note I don't know how the native CRDT system implemented in IPFS interacts with the data structure I describe. I doubt it will harm correctness, but it might harm efficiency. More thought should be given to this.

haadcode commented 8 years ago

@eladve thanks for posting this publicly for discussion! CC'ing people who might have input. @jbenet @whyrusleeping @nicola @mikolalysenko have discussed similar ideas in the past.

make-github-pseudonymous-again commented 7 years ago

Could also look at VEB trees or B-trees to minimize the number of blocks that need to be loaded from the network which I guess will be the bottleneck.

ianopolous commented 7 years ago

I've implemented a merkle-btree (balanced) in IPFS and have been using it in Peergos for over a year. There is a synchronous version here: https://github.com/ianopolous/merkle-btree

And an asynchronous version here (we use this one after compiling it to JS): https://github.com/Peergos/Peergos/tree/master/src/peergos/shared/merklebtree

eladve commented 7 years ago

@ianopolous Is there a text description of the merkle-btree somewhere? I want to understand what it does, and hope not to have to go over java source code.

Say, does your data structure implement a key-value storage, or something else? The description only says that it is "content addressed", not that it is "key-addressed" or something to the effect.

ianopolous commented 7 years ago

@eladve It is a standard btree algorithm. It maps byte arrays to byte arrays, so key-value storage. We use it to map 32 byte keys to IPFS hashes (Multihashes). It's done in such a way that pinning the root in IPFS will pin the entire tree, and the things the hashes point to. We use it with a branching factor of 16, which seems to work well. I should write some proper documentation.

make-github-pseudonymous-again commented 7 years ago

pinning the root

@ianopolous does that mean that storing the root on your computer implies storing the entire tree?

ianopolous commented 7 years ago

Nope, only if you pin it recursively.

mikolalysenko commented 7 years ago

Realistically I think the best option is going to be a B+-tree with a fixed block size computed from the size of an IPLD object/query key. Things like VEB trees/fusion trees/whatever don't make any sense in the IPLD model since we know the size of a block a-priori.

Could just try knocking out something simple and see how it performs.

ianopolous commented 7 years ago

@mikolalysenko I'm curious why you suggest a b+-tree over a btree? Are you mainly considering write once, read many loads? If a many-write many-read load is a target, I would expect a btree to have better write throughput.

mikolalysenko commented 7 years ago

I haven't thought it all the way through, either would probably work fine but experiments would have to be done to tune the implementation. I was mainly trying to make the point that VEB-trees solve a problem that doesn't make sense in the IPLD model.

daviddias commented 7 years ago

//cc @pgte who has implemented https://github.com/ipfs-shipyard/ipfs-level, https://github.com/ipfs-shipyard/ipfs-live-db and https://github.com/ipfs-shipyard/tevere

pgte commented 7 years ago

@eladve super interesting! The previous cases @diasdavid mentioned have a local cache containing the entire database, but I think that the kind of remote indexing you described is super interesting for when all data can't be local.

@eladve I'm also interested on how to manage concurrency: when 2 or more nodes concurrently update the data. By what you described, each will create a different root node, and depending on which node you read from, you will miss some updates. — or is there some type of merging strategy for this index? Do you have thoughts on this?

ipfs / notes

Implement databases over IPFS using Persistent Balanced Trees #161