Iterating over all KVs by their insertion order

amityahav commented 1 year ago

following this where i explained that there may be alot of random io. i suggest to add another functionality which will iterate over all the kvs in the DB more efficiently. my proposal is modifying the BTREE implementation of the in-memory index in such a way that all KV pairs will be chained in a doubly linked list. this list will be constructed during insertion time where for a new KV inserted we will link it to the previous insert KV pair. (also handling the cases of overwrites). we will also keep 2 pointers for the head and the tail of the list and by that allow the user to iterate in ascending/descending of insertion order. by doing all of that when iterating the reading from WAL files will be in a more sequential manner which will utilize the kernel's page cache better and also the WAL applicative blockCache.

roseduan commented 1 year ago

We are using the google btree repo, I think it is a little difficult to change the btree code.

The #265 only acquires for iterating all keys, so we just call the Ascend to get all keys directly, it will be efficient since they are all in memory.

amityahav commented 1 year ago

Currently the Ascend function also invoke reading from WAL from values but if you omit that its indeed fast. but as for the usecase where u do actually want to iterate over all KV pairs maybe its worth the effort to optimize its performance. it will probably will require an in-house implementation of the BTREE or at least a fork of Google's. Lemme know what u think

roseduan commented 1 year ago

Yes if we want to iterate all k/v pairs in db, it will read WAL to get data.

Umm. But I am not sure whether someone needs it, maybe we can do it according to the actual situation.

amityahav commented 1 year ago

ill think ill come up with a POC for that anyways just to prove myself that its indeed faster

amityahav commented 1 year ago

@roseduan a usecase that i find this feature useful is the merging operation where instead of iterating over the all of the WAL files you can instead iterate only over the valid values efficiently which will reduce the merging time when the WAL files are large

roseduan commented 1 year ago

@roseduan a usecase that i find this feature useful is the merging operation where instead of iterating over the all of the WAL files you can instead iterate only over the valid values efficiently which will reduce the merging time when the WAL files are large

Oh why? If we iterate the btree to get all values, it will block the read and write.

amityahav commented 1 year ago

Yes you are right totally missed that

amityahav commented 1 year ago

But also if you merge and writes are happening concurrently you only merge a snapshot in time and there's new data that is not part of the merge process and needs to be appended to the new data files. How do u handle that?

roseduan commented 1 year ago

But also if you merge and writes are happening concurrently you only merge a snapshot in time and there's new data that is not part of the merge process and needs to be appended to the new data files. How do u handle that?

The newly added data while merging will be treated as the normal WAL records, and it will be loaded from wal to rebuild the index.

eg. I merge seg 1 2 3 4 and generate a hint file 1 2 and the new seg files while merging is 5 6

so when restarting, build the index from hint file 1 2 first. then load all data from 5 6, build index as normal.

amityahav commented 1 year ago

So i've implemented a working BTREE that respects the insertion order for arbitrary keys and it was done by adding small amount of code to the wrapper class and not the google's implementation. then i've created a simple benchmark to compare among the existing Ascend performance and the new DescendByInsertion with 600k KV pairs in the DB. in most of my tests DescendByInsertion would gain at least x5 faster performance when enabling the blockCache. when disabling the performance is quite the same. so lemme know if you would be intersted in such featrue. @roseduan

func Benchmark_Descend(b *testing.B) {
    options := rosedb.DefaultOptions
    options.BlockCache = 32 * 1024 * 10
    options.DirPath = "./tmp/rosedb"

    db, err := rosedb.Open(options)
    if err != nil {
        panic(err)
    }

    //for i := 0; i < 600000; i++ {
    //  key := utils.RandomKey(10)
    //  value := utils.RandomValue(1024)
    //  _ = db.Put(key, value)
    //}

    utils.CacheHits = 0
    utils.CacheMiss = 0
    b.Run("descend", func(b *testing.B) {
        b.ResetTimer()
        b.ReportAllocs()
        for i := 0; i < b.N; i++ {
            utils.CacheHits = 0
            utils.CacheMiss = 0
            t := time.Now()
            _ = db.DescendByInsertion(func(k []byte, v []byte) error {
                return nil
            })
            fmt.Printf("completed in %d ms\n",
                time.Now().Sub(t).Milliseconds(),
            )
            fmt.Printf("cache hits: %d\n cache misses: %d\n", utils.CacheHits, utils.CacheMiss)
        }
    })
    db.Close()

//Benchmark_Descend/descend
//completed in 1095 ms
//cache hits: 598998
//cache misses: 20211
//Benchmark_Descend/descend-10    1 1095058791 ns/op    2131705808 B/op  3059916 allocs/op

    db, err = rosedb.Open(options)
    if err != nil {
        panic(err)
    }
    defer db.Close()
    utils.CacheHits = 0
    utils.CacheMiss = 0
    b.Run("ascend", func(b *testing.B) {
        b.ReportAllocs()
        for i := 0; i < b.N; i++ {
            t := time.Now()
            db.Ascend(func(k []byte, v []byte) (bool, error) {
                return true, nil
            })
            fmt.Printf("completed in %d ms\n",
                time.Now().Sub(t).Milliseconds(),
            )
            fmt.Printf("cache hits: %d\n cache misses: %d\n\n", utils.CacheHits, utils.CacheMiss)
            utils.CacheHits = 0
            utils.CacheMiss = 0
        }
    })

//Benchmark_Descend/ascend
//completed in 5139 ms
//cache hits: 295
//cache misses: 618914

//Benchmark_Descend/ascend-10      1    5139481083 ns/op    21792695688 B/op     4271704 allocs/op
}

roseduan commented 1 year ago

DescendByInsertion means the key is sorted in insertion order not lexicographically? If so, I am not sure whether someone needs the feature.

amityahav commented 1 year ago

It's by the latest insertion but it's basically iterating all keys and values in general in faster way

roseduan commented 1 year ago

Thanks. In my experience, maybe most users want to iterate the keys in lexicographical order, we can consider it when someone needs it.

Thanks anyway.

rosedblabs / rosedb

Iterating over all KVs by their insertion order #271