google / btree

BTree provides a simple, ordered, in-memory data structure for Go programs.
Apache License 2.0
3.87k stars 414 forks source link

Persistent version of btree #10

Open keep94 opened 7 years ago

keep94 commented 7 years ago

This btree data structure is an ephemeral data structure. Changes to a particular btree instance are destructive. Would be nice if there was a persistent, immutable counter part to this data structure. Adding to or deleting from, the persistent version of btree creates a new btree with the original intact. We could have the ability to batch together mutations to cut down on the creation of intermediate instances.

A persistent version of btree would make transactional processing very easy as one could quickly revert to an older version. A persistent, immutable btree helps with concurrency too. On goroutine could read the btree while another is modifying it. A persistent btree helps with undo / redo operations too.

I figure the persistent version would be almost like the ephemeral version except instead of mutating nodes in place, we do copy-on-write. Doing a single mutation to a persistent btree, one add or one delete, would be cheap. The old and new versions would share many of the same nodes. Only log(n) nodes would be different.

With the persistent version of btree, there would be no free list or recycling of nodes. Each node in the btree would be immutable as it could be shared by many different versions of the same btree.

If this sounds interesting to you, I'd be willing to take on the work.

gconnell commented 7 years ago

While I think a persistent version might be nice, I'm not sure if it'd fit in with this project. I'd suggest initially forking the project and doing the changes, and if we find the codebase remains remarkably similar we can look at merging back together. Thoughts?

On Wed, Sep 21, 2016 at 10:24 PM, keep94 notifications@github.com wrote:

This btree data structure is an ephemeral data structure. Changes to a particular btree instance are destructive. Would be nice if there was a persistent, immutable counter part to this data structure. Adding to or deleting from, the persistent version of btree creates a new btree with the original intact. We could have the ability to batch together mutations to cut down on the creation of intermediate instances.

A persistent version of btree would make transactional processing very easy as one could quickly revert to an older version. A persistent, immutable btree helps with concurrency too. On goroutine could read the btree while another is modifying it. A persistent btree helps with undo / redo operations too.

I figure the persistent version would be almost like the ephemeral version except instead of mutating nodes in place, we do copy-on-write. Doing a single mutation to a persistent btree, one add or one delete, would be cheap. The old and new versions would share many of the same nodes. Only log(n) nodes would be different.

With the persistent version of btree, there would be no free list or recycling of nodes. Each node in the btree would be immutable as it could be shared by many different versions of the same btree.

If this sounds interesting to you, I'd be willing to take on the work.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/google/btree/issues/10, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJ9tityyQFaj__e-6sbrU3xVUssItCcks5qsgL_gaJpZM4KDhCD .

keep94 commented 7 years ago

Sounds good to me.

By the way, this is a great library.

keep94 commented 7 years ago

In this post, I will discuss design goals and my high level plan for implementing:

Design goals:

Non goals:

High level design:

I plan to reuse the node struct for nodes in a persistent btree along with the code for the node struct. The difference with the persistent btree is that I will employ copy-on-write instead of modifying existing nodes in place. That is, whenever I need to change a node, I will make a copy of it first and then mutate the copy in place.

To prevent the creation of gratuitous intermediate objects, I will pass around a set of node pointers that have already been copied for write. I will employ copy-on-write only when a node is not already in the set of writable node pointers. The lifespan of this set is only for one group of batch changes. At the beginning of a batch change I allocate an empty version of this set as a local variable. I pass the set around as a parameter to all the functions as the batch changes are happening. When the changes are done, the set goes away. The need to pass the copy-on-write set around for changes on persistent btrees complicates the code reuse, but I get around this by making the operations on btree nodes higher order functions.

For each existing node function in btree, I make a helper function that does the same thing. The helper function takes the exact same parameters as the existing function, plus the writable node set, plus shims for making the recursive calls. The persistent versions will have to make recursive calls to themselves and update child pointers while the existing conventional functions will have to make calls to themselves modifying child pointers in place. Each existing node function will simply delegate to its helper function passing nil for the copy-on-write set along with mutable style shims for the recursive calls.

In addition, each existing node function will get a persistent version that also takes the same parameters plus the copy-on-write set. The persistent version will return the same values, plus a node pointer. The persistent version of each function will first ask the copy-on-write set for a writable copy of its receiver. Then it will call the helper function on that writable copy passing the copy-on-write set to it along with persistent style recursive call shims that will call the persistent version and updates child pointers. Finally, the persistent version will return the values that the helper function returned plus the writable copy of the receiver. I will be able to apply this work in a cookie cutter fashion to all the code of the *node struct.

Sample Work:

You can see the start of this work on the first three methods of the node struct here https://github.com/keep94/btree/commit/8e32a6dbe9a87fddb6d2c3ac59ee503163c39b04

Conclusion:

The use of the copy-on-write set allows me to reuse the existing node objects with minimal changes to implement persistent btree. While copy-on-write set will cause additional work for GC during mutations for persistent btree, I believe it is a fine trade off for maintaining the elegance and correctness of the existing btree code. Moreover, using a copy-on-write set when mutating persistent btrees will not affect the operation of existing btrees. The only changes to the code path of the existing btree code will be extra function calls for the helper functions and the shims for doing the recursion. Function calls are cheap these days, so I believe it will have minimal impact on performance for the existing code.

Please let me know whether or not this work is right for this google/btree repo. Thank you in advance.

keep94 commented 7 years ago

I have implemented what I proposed above and have written tests. However, instead of using shims, I decided to let nil copy-on-write sets allow requests for writable versions of nodes, in which case it just returns the node unchanged. In the ephemeral case, asking for a writable version of a node is the same as assigning a node pointer to itself.

This small change allowed me to leave the call sites of the recursive calls unchanged obviating the need for shims and greatly simplifying the changes I made. Without the shims, the size of the call stack is about the same as it was before.

I ran the benchmarks and compared with the master branch. Although I took great effort to disturb the existing code as little as possible, the benchmarks for insert and delete run 3 to 4 percent slower with my changes than before. That is about 675 ns/op on my small mac mini vs 650 ns/op. I am running go 1.1.2.

I attribute the slight slow down to the overhead of asking for writable copies of nodes. Although this is essentially a no-op in the ephemeral case, the function call to do it does cost something and the call is made as often as other mutating calls.

The 3 to 4% slow down may be a fair trade off considering that I am reusing most of the existing code. Code reuse is a good thing as it is easier to maintain. Let me know what you think.

https://github.com/keep94/btree/tree/persistent2

keep94 commented 7 years ago

API proposal:

type ImmutableBTree struct { }

// NewImmutable creates an empty tree with given degree. func NewImmutable(degree int) *ImmutableBTree

// ImmutableBTree has all the read-only methods that BTree has like Get(). // But it has no mutating methods such as Delete()

type Builder struct { }

// NewBuilder creates a new builder starting from tr. // The new builder has the same degree as tr. func NewBuilder(tr ImmutableBTree) Builder

// Builder has all the methods that BTree has including all the read methods plus...

// Set sets this builder to tr giving this builder the same degree as tr. func (b Builder) Set(tr ImmutableBTree) *Builder

// Build builds a tree with the same items and degree as this builder. func (b Builder) Build() ImmutableBTree

gconnell commented 7 years ago

Why not just do this:

type ImmutableBTree interface {
  ... all read methods of btree...
}

To make IBT:

func MakeImmutableBTree() {
  b := NewBTree()
  ... add stuff ...
  return ImmutableBTree(b)
}
keep94 commented 7 years ago

Likely btree will get new read methods. If we add new methods to the interface won't we break anyone that already implemented it?

On Friday, 7 October 2016, Graeme Connell notifications@github.com wrote:

Why not just do this:

type ImmutableBTree interface { ... all read methods of btree... }

To make IBT:

func MakeImmutableBTree() { b := NewBTree() ... add stuff ... return ImmutableBTree(b) }

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/btree/issues/10#issuecomment-252286629, or mute the thread https://github.com/notifications/unsubscribe-auth/ABrhcnO27kaAC9pvBq_a3H-B5XuXWNYbks5qxmg9gaJpZM4KDhCD .

gconnell commented 7 years ago

I'm happy to have the interface as part of the main package, if you want. We can put a comment in there that says "Implement this interface at your own risk, it'll track btree.BTree's read methods"

On Fri, Oct 7, 2016 at 9:49 AM, keep94 notifications@github.com wrote:

Likely btree will get new read methods. If we add new methods to the interface won't we break anyone that already implemented it?

On Friday, 7 October 2016, Graeme Connell notifications@github.com wrote:

Why not just do this:

type ImmutableBTree interface { ... all read methods of btree... }

To make IBT:

func MakeImmutableBTree() { b := NewBTree() ... add stuff ... return ImmutableBTree(b) }

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/btree/issues/10#issuecomment-252286629, or mute the thread https://github.com/notifications/unsubscribe- auth/ABrhcnO27kaAC9pvBq_a3H-B5XuXWNYbks5qxmg9gaJpZM4KDhCD .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/btree/issues/10#issuecomment-252288767, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJ9tm6eOufqYkQtRg-BEqYQp6pu2cozks5qxmolgaJpZM4KDhCD .

keep94 commented 7 years ago

Regarding your proposal of adding the interface, what becomes of the builder type?

On Friday, 7 October 2016, Graeme Connell notifications@github.com wrote:

I'm happy to have the interface as part of the main package, if you want. We can put a comment in there that says "Implement this interface at your own risk, it'll track btree.BTree's read methods"

On Fri, Oct 7, 2016 at 9:49 AM, keep94 <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Likely btree will get new read methods. If we add new methods to the interface won't we break anyone that already implemented it?

On Friday, 7 October 2016, Graeme Connell <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

Why not just do this:

type ImmutableBTree interface { ... all read methods of btree... }

To make IBT:

func MakeImmutableBTree() { b := NewBTree() ... add stuff ... return ImmutableBTree(b) }

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/btree/issues/10#issuecomment-252286629, or mute the thread https://github.com/notifications/unsubscribe- auth/ABrhcnO27kaAC9pvBq_a3H-B5XuXWNYbks5qxmg9gaJpZM4KDhCD .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/btree/issues/10#issuecomment-252288767, or mute the thread https://github.com/notifications/unsubscribe-auth/ABJ9tm6eOufqYkQtRg- BEqYQp6pu2cozks5qxmolgaJpZM4KDhCD .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/btree/issues/10#issuecomment-252289278, or mute the thread https://github.com/notifications/unsubscribe-auth/ABrhcr1MchHMK-XR7yupa0l1CT0qPWSLks5qxmqYgaJpZM4KDhCD .

keep94 commented 7 years ago

If I understand correctly, you suggest I wrap a normal btree instance in an immutable interface. Like unmodifiable list in Java, such a btree could still change as a goroutine could call write methods on the underlying btree pointer. What I need is an immutable btree that is guaranteed not to change.

With the interface I'd still have to do a full defensive copy of the underlying btree to make any changes or else I modify immutable btree. My changes will be small in comparison to size of tree. With a builder that uses copy on write. I only need copy log n nodes rather than whole tree to make modifications. With a truly immutable btree being read by multiple goroutines I can make changes using a local builder without acquiring a lock. The result is a brand new immutable btree that includes the changes while leaving the original unchanged.

On Friday, 7 October 2016, Travis Keep keep94@gmail.com wrote:

Regarding your proposal of adding the interface, what becomes of the builder type?

On Friday, 7 October 2016, Graeme Connell <notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

I'm happy to have the interface as part of the main package, if you want. We can put a comment in there that says "Implement this interface at your own risk, it'll track btree.BTree's read methods"

On Fri, Oct 7, 2016 at 9:49 AM, keep94 notifications@github.com wrote:

Likely btree will get new read methods. If we add new methods to the interface won't we break anyone that already implemented it?

On Friday, 7 October 2016, Graeme Connell notifications@github.com wrote:

Why not just do this:

type ImmutableBTree interface { ... all read methods of btree... }

To make IBT:

func MakeImmutableBTree() { b := NewBTree() ... add stuff ... return ImmutableBTree(b) }

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/btree/issues/10#issuecomment-252286629, or mute the thread https://github.com/notifications/unsubscribe- auth/ABrhcnO27kaAC9pvBq_a3H-B5XuXWNYbks5qxmg9gaJpZM4KDhCD .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google/btree/issues/10#issuecomment-252288767, or mute the thread https://github.com/notifications/unsubscribe-auth/ ABJ9tm6eOufqYkQtRg-BEqYQp6pu2cozks5qxmolgaJpZM4KDhCD .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/btree/issues/10#issuecomment-252289278, or mute the thread https://github.com/notifications/unsubscribe-auth/ABrhcr1MchHMK-XR7yupa0l1CT0qPWSLks5qxmqYgaJpZM4KDhCD .

keep94 commented 7 years ago

If the extra API load of having the builder concerns you, I did think of a way that we can safely fold the Build and Set methods into BTree and do away with the Builder class completely, but the tradeoff is that the Build method will run in sub-linear time, average case, instead of constant time.

If we change the Build method so that it does deep copies of nodes that are reachable only by the BTree and do shallow copies of shared nodes, then Build becomes a truly read-only method and can be safely folded into BTree without confusion. The downside is that the Build method becomes much slower. For a BTree built from scratch where all nodes are reachable only from the btree, Build would run in O(n) time and be the exact same as a deep copy. Even with this modified build method, the BTree would still have to do copy-on-write when changing shared nodes or else it would mutate other ImmutableBTrees.

So, in conclusion folding Build and Set into the BTree class itself would reduce API load by getting rid of the Builder class, but getting a modified copy of an ImmutableBTree would take twice as long as having a builder class, best case, since in addition to doing copy-on-write as modifications happen, the Build method would have to do defensive copying of the same modified nodes yet again. In the worst case, Build would take O(n) time

keep94 commented 7 years ago

If it sounds good to you, I am willing to get rid of the Builder class. The API would look like this. Let me know what you think.

type BTree struct { }

// all the usual methods plus

// Build builds an immutable version of this tree func (b BTree) Build() ImmutableBTree { }

// Set changes this instance to have the same items and degree as tree. func (b BTree) Set(tree ImmutableBTree) *BTree { }

type ImmutableBTree struct { }

// ImmutableBTree has all the read-only methods of BTree.

chaitanya9186 commented 7 years ago

Is there any plan to implement persistent btrees?

qwertie commented 5 years ago

I implemented a "semi-persistent" in-memory B+ tree: npm install sorted-btree

It can be used in a mutable or immutable fashion (the immutable fashion is tree.with(key,value)/tree.without(key)), as you choose.