google / btree

BTree provides a simple, ordered, in-memory data structure for Go programs.
Apache License 2.0
3.9k stars 414 forks source link

Q: how do I delete a range? #18

Open glycerine opened 7 years ago

glycerine commented 7 years ago

Suppose I have a simple set of intervals whose end time (endx) is stored in a btree. Lets say the interval represents the time a user's compute job completed, so I store the userid and a jobid with it too. I'll use the jobid (assume they are all unique) to break ties so that all records stick around.

Then I want to delete all those records with time <= some end time. Here's what I try:

package main

import (
    "fmt"

    "github.com/google/btree"
)

type job struct {
    endx  int64
    user  string
    jobid string
}

func (a *job) Less(than btree.Item) bool {
    b := than.(*job)
    if a.endx == b.endx {
        // simply distinguish the order,
        // so we get to keep both in the tree.
        return a.jobid < b.jobid
    }
    return a.endx < b.endx
}

func main() {
    tree := btree.New(2)

    tree.ReplaceOrInsert(&job{endx: 1, jobid: "hij"})
    tree.ReplaceOrInsert(&job{endx: 1, jobid: "abc"})
    tree.ReplaceOrInsert(&job{endx: 1, jobid: "def"})

    tree.ReplaceOrInsert(&job{endx: 2, jobid: "hij"})
    tree.ReplaceOrInsert(&job{endx: 2, jobid: "abc"})
    tree.ReplaceOrInsert(&job{endx: 2, jobid: "def"})

    tree.ReplaceOrInsert(&job{endx: 3, jobid: "hij"})
    tree.ReplaceOrInsert(&job{endx: 3, jobid: "abc"})
    tree.ReplaceOrInsert(&job{endx: 3, jobid: "def"})

    fmt.Printf("len = %v\n", tree.Len())

    // delete through the 2s
    tree.AscendGreaterOrEqual(&job{endx: 0}, func(item btree.Item) bool {
        j := item.(*job)
        if j.endx <= 2 {
            fmt.Printf("delete pass deletes %#v\n", item)
            tree.Delete(j)
            return true
        }
        fmt.Printf("delete pass ignores %#v\n", item)
        return false
    })

    fmt.Printf("\n\n tree after deleting everything with "+
        "endx <=2, is size %v; should be size 3\n",
        tree.Len())
    tree.AscendGreaterOrEqual(&job{endx: 0}, func(item btree.Item) bool {

        fmt.Printf("%#v\n", item)
        return true
    })

}

here's what I get. This strange result presents in the upstream petar/gollrb as well.


go run ex1.go                                                                                      
len = 9                                                                                            
delete pass deletes &main.job{endx:1, user:"", jobid:"abc"}                                        
delete pass deletes &main.job{endx:1, user:"", jobid:"hij"}                                        
delete pass deletes &main.job{endx:2, user:"", jobid:"abc"}                                        
delete pass ignores &main.job{endx:3, user:"", jobid:"abc"}                                        

 tree after deleting everything with endx <=2, is size 6; should be size 3                         
&main.job{endx:1, user:"", jobid:"def"}                                                            
&main.job{endx:2, user:"", jobid:"def"}                                                            
&main.job{endx:2, user:"", jobid:"hij"}                                                            
&main.job{endx:3, user:"", jobid:"abc"}                                                            
&main.job{endx:3, user:"", jobid:"def"}                                                            
&main.job{endx:3, user:"", jobid:"hij"}                                                            
glycerine commented 7 years ago

For the sake of generating API ideas, consider how this works in a another red-black tree package. I can advance the iterator before deleting behind me:

package main
import (
    "fmt"
    "github.com/glycerine/rbtree"
)

type job struct {
    endx  int64
    user  string
    jobid string
}

func newTree() *rbtree.Tree {
    return rbtree.NewTree(
        func(a1, b2 rbtree.Item) int {
            a := a1.(*job)
            b := b2.(*job)
            if a.endx == b.endx {
                switch {
                case a.jobid == b.jobid:
                    return 0
                case a.jobid < b.jobid:
                    return -1
                case a.jobid > b.jobid:
                    return 1
                }
            }
            return int(a.endx - b.endx)
        })
}

func main() {
    tree := newTree()

    tree.Insert(&job{endx: 1, jobid: "hij"})
    tree.Insert(&job{endx: 1, jobid: "abc"})
    tree.Insert(&job{endx: 1, jobid: "def"})

    tree.Insert(&job{endx: 2, jobid: "hij"})
    tree.Insert(&job{endx: 2, jobid: "abc"})
    tree.Insert(&job{endx: 2, jobid: "def"})

    tree.Insert(&job{endx: 3, jobid: "hij"})
    tree.Insert(&job{endx: 3, jobid: "abc"})
    tree.Insert(&job{endx: 3, jobid: "def"})

    show(tree)

    // delete through the 2s

    for it := tree.Min(); !it.Limit(); {
        item := it.Item().(*job)
        if item.endx <= 2 {
            fmt.Printf("delete pass deletes %#v\n", item)

            next := it.Next()
            tree.DeleteWithIterator(it)
            it = next
        } else {
            fmt.Printf("delete pass ignores %#v\n", item)
            break // we can stop scanning now.
        }
    }
    show(tree)
}

func show(tree *rbtree.Tree) {
    fmt.Printf("showing tree: len = %v\n", tree.Len())
    for it := tree.Min(); !it.Limit(); it = it.Next() {
        item := it.Item().(*job)
        fmt.Printf("%#v\n", item)
    }

}

output:

showing tree: len = 9                                                                              
&main.job{endx:1, user:"", jobid:"abc"}                                                            
&main.job{endx:1, user:"", jobid:"def"}                                                            
&main.job{endx:1, user:"", jobid:"hij"}                                                            
&main.job{endx:2, user:"", jobid:"abc"}                                                            
&main.job{endx:2, user:"", jobid:"def"}                                                            
&main.job{endx:2, user:"", jobid:"hij"}                                                            
&main.job{endx:3, user:"", jobid:"abc"}                                                            
&main.job{endx:3, user:"", jobid:"def"}                                                            
&main.job{endx:3, user:"", jobid:"hij"}                                                            
delete pass deletes &main.job{endx:1, user:"", jobid:"abc"}                                        
delete pass deletes &main.job{endx:1, user:"", jobid:"def"}                                        
delete pass deletes &main.job{endx:1, user:"", jobid:"hij"}                                        
delete pass deletes &main.job{endx:2, user:"", jobid:"abc"}                                        
delete pass deletes &main.job{endx:2, user:"", jobid:"def"}                                        
delete pass deletes &main.job{endx:2, user:"", jobid:"hij"}                                        
delete pass ignores &main.job{endx:3, user:"", jobid:"abc"}                                        
showing tree: len = 3                                                                              
&main.job{endx:3, user:"", jobid:"abc"}                                                            
&main.job{endx:3, user:"", jobid:"def"}                                                            
&main.job{endx:3, user:"", jobid:"hij"}      
gconnell commented 7 years ago

Great question, and sorry I didn't see this sooner.

As you've found, currently our iterator functions aren't resilient to mutations during iteration. We might work on this later (and happy to accept pull requests ;), but to get things going right away, you could do this:

todelete := []btree.Item{}
tree.Ascend(..., func(x btree.Item) bool {
  todelete = append(todelete, x)
  return true
})
for _, x := range todelete {
  tree.Delete(x)
}

It's not as efficient as it could be, and it requires extra storage, but it also actually works right now :)

glycerine commented 7 years ago

For efficiency, the DeleteMin() method turned out to work well. This is isn't a general solution if you want to delete a range that starts in the middle of you elements, but it worked for the case above. As in the example above, where you want to delete all elements belong some time threshold, with a callback for each deletion, I'm doing:

func deleteThrough(t *btree, x int64, callme func(goner *mjob, through int64)) {                              

    for {                                                                                                       
        min := t.Min()                                                                                          
        if min == nil {                                                                                         
            return                                                                                              
        }                                                                                                       
        cur := min.(*mjob)                                                                                      
        if cur.endx > x {                                                                                       
            return                                                                                              
        }                                                                                                                                                            
        t.DeleteMin()                                                                                           
        if callme != nil {                                                                                      
            callme(cur, x)                                                                                      
        }                                                                                                       
    }                                                                                                           
}  

At least I think (hope) this is efficient. If its not, please let me know! :-) Typically a tree will keep a pointer to its minimum element... but if its log(n) to find the smallest element, this might be a problem...

glycerine commented 7 years ago

Actually its worse than I expected.

https://github.com/google/btree/blob/master/btree.go#L153

makes it look like every delete is O(n) in the size of the tree n, as it does a full copy of the entire tree.

This is really, really, not what I would expect from a balanced binary tree implementation. Deleting the entire tree becomes an O(n*n) operation.

This is important. I'll make a separate issue.

gconnell commented 7 years ago

Nope, your original supposition was correct: it's O(logn) for a delete.

https://github.com/google/btree/blob/master/btree.go#L153 is the code called to delete an item from a single node, which is O(degree) of the btree (items will be max 'degree' in length, one itemset per node).

Your code is nicely efficient, but could be slightly moreso if you expect to delete more than 2 entries in your loop. In that case, you could blindly issue the delete, then put it back when you get to the one you care about.

var item btree.Item
for item = t.DeleteMin(); item.(*mjob).endx <= x {}
t.Insert(item)

Calling Min, which is O(logn), then Delete, which is O(logn) effectively doubles your processing time, unless you expect to only delete zero or one element each time.

cstockton commented 6 years ago

I ran into having to clear the tree myself today, created #20 which if it's not accepted it's small enough to copy pasta if it will help you.