guycipher / k4

High-performance open-source, durable, transactional embedded storage engine designed for low-latency, and optimized read and write efficiency.
https://pkg.go.dev/github.com/guycipher/k4/v2
BSD 3-Clause "New" or "Revised" License
249 stars 6 forks source link

Possible issues #25

Closed marvin-j97 closed 11 hours ago

marvin-j97 commented 2 weeks ago

I found this on Reddit but I decided to post here because my comment got too long.

Disclaimer: I'm making https://github.com/fjall-rs/lsm-tree, so I have some experience with LSM-trees.

I don't really know Go well so some stuff may be wrong, but reading through a bit of the code I feel like there are some questionable things I think:

  1. The tombstone is a fixed value, so one could break the implementation by writing a "$tombstone" value manually. Also this makes tombstones unnecessarily large.
  2. I don't understand the "granular page locking" stuff - you don't need locking on SSTables... they are immutable. I don't understand the idea of the "pager" and "pages" at all... you don't seem to use block-based SSTables, that's why I can't find index blocks, but that probably makes compression pretty ineffective compared to RocksDB. Are pages cached, if so, what's the cache size? Is it set correctly for RocksDB as well?
  3. The Cuckoo filter seems to return null if it returns false... but Cuckoo filters are still a probabilistic data structure, or are you making sure it has a FPR of 0.0? Otherwise it may return a wrong result. How much memory does the Cuckoo filter use, considering you have to index every key?
  4. I don't see any notion of "levels" in the tree, K4 seems to support range queries, but there seems to be a single level, so you are doing size tiered compaction. How many SSTables are usually kept around? Looking at the compaction, there's a timer which I don't think makes sense. When you have a burst traffic of writes you will end up with more SSTables and have to wait, possibly 1 hour, for the compaction to fix up your read performance. Compactions should operate on the current state of SSTables in the LSM-tree.
  5. The flushes and compactions don't seem to fsync their data before appending the new SSTables.
  6. Transactions are really just write batches?
  7. The conclusion in the wiki is obviously written by an LLM, not to mention there are wrong statements (in point 4) like K4 having atomic transactions (as if RocksDB doesn't do that) and being able to recover SSTables from the WAL... which makes no sense? You recover memtables from a WAL (as you seem to do as well).
  8. Also I think having to call RecoverFromWAL manually is a recipe for disaster. The readme says If you have a populated WAL file in the data directory but no data files aka sstables, but I don't understand that. What's the connection between having no SSTables and recovering a WAL? You always recover a WAL to restore the memtable(s), no matter if you have SSTables or not.
  9. Benchmarks use HDD, which probably hides inefficiencies in the read path; also RocksDB is typically run on SSDs.
  10. By default, RocksDB flushes writes to OS buffers... does K4 do that as well? If not, it should be disabled for RocksDB (manual_wal_flush = true) to make the comparison more fair.
guycipher commented 2 weeks ago

Hey @marvin-j97 good morning!! Thank you for this extensive write up.

I've seen your project. Good work and keep it up.

The tombstone is a fixed value, so one could break the implementation by writing a "$tombstone" value manually. Also this makes tombstones unnecessarily large. They are a fixed value yes. There can be minor changes made to neglect a tombstone value for a key coming from public methods. I've thought about this and thank you for bringing it back up.

I don't understand the "granular page locking" stuff - you don't need locking on SSTables... they are immutable. I don't understand the idea of the "pager" and "pages" at all... you don't seem to use block-based SSTables, that's why I can't find index blocks, but that probably makes compression pretty ineffective compared to RocksDB. Are pages cached, if so, what's the cache size? Is it set correctly for RocksDB as well?

On read operations many threads can access the Get method, yes memtable is locked but sstables are locked on the page level on reads, for granular reads. There are no caching pages.

The Cuckoo filter seems to return null if it returns false... but Cuckoo filters are still a probabilistic data structure, or are you making sure it has a FPR of 0.0? Otherwise it may return a wrong result. How much memory does it use, considering you have to index every key?

I've tried to make it as memory efficient as I could whilst making it usable for page indexes for a key value pair with prefixes. I am not making sure it has an FPR of 0, the FPR with this cuckoo is determined sorta by load factor. The idea behind it was for it to provide less FPR then the previous bloomfilter implementation and offer an index solution for key value pages.

I don't see any notion of "levels" in the tree, K4 seems to support range queries, but there seems to be a single level, so you are doing size tiered compaction. How many SSTables are usually kept around? Looking at the compaction, there's a timer which I don't think makes sense. When you have a burst traffic of writes you will end up with more SSTables and have to wait, possibly 1 hour, for the compaction to fix up your read performance. Compactions should operate on the current state of SSTables in the LSM-tree.

The Memtable is level 0 whilst every other sstable is another level in K4. There is always 1 sstable if compaction occurs very frequently.

When you have a burst traffic of writes you will end up with more SSTables and have to wait, possibly 1 hour, for the compaction That's why I made it so you can set an interval or using EscalateCompaction which is more efficient for me.

The flushes and compactions don't seem to fsync their data before appending the new SSTables. Transactions are really just write batches?

They do, the underlaying pager syncs data every set time and escalates every minute and pager closure.

The conclusion in the wiki is obviously written by an LLM, not to mention there are wrong statements (in point 4) like K4 having atomic transactions (as if RocksDB doesn't do that) and being able to recover SSTables from the WAL... which makes no sense? You recover memtables from a WAL (as you seem to do as well).

Huh? No. Firstly this isn't a clone of RocksDB. Everything is designed from my own ideas referring to a a log structure merge tree. Yes you can batch operations in a transaction and commit them atomically to the memtable..

If you have NO sstables, NO data. You can recover from your WAL file to regenerate your K4 instance as it was.

Also I think having to call RecoverFromWAL manually is a recipe for disaster. The readme says If you have a populated WAL file in the data directory but no data files aka sstables, but I don't understand that. What's the connection between having no SSTables and recovering a WAL? You always recover a WAL to restore the memtable(s), no matter if you have SSTables or not.

How can you not understand that? If you have a WAL file that is populated from even another system, you can recovery your K4 instance as it was before.

Benchmarks use HDD, which probably hides inefficiencies in the read path; also RocksDB is typically run on SSDs. By default, RocksDB flushes writes to OS buffers... does K4 do that as well? If not, it should be disabled for RocksDB (manual_wal_flush = true) to make the comparison more fair.

I did, look at the latest post. I benched on Google Cloud N2 with an SSD 250GB and K4 performed amazingly whereas RocksDB did not at similar configurations. I had to stop the test for RocksDB as 10 million key values was making it poop the bed.

I have to also add, there is nothing wrong with trying to be innovative and creative on the design choices here. If I copied RocksDB, what would make K4 special?

guycipher commented 2 weeks ago

I starred your project. Keep it up. Storage engines are no small feat.

guycipher commented 2 weeks ago

https://github.com/fjall-rs/fjall good stuff. I saw this as well out there, gave it some love as well.

guycipher commented 2 weeks ago

If you have any more questions do let me know. I am getting a coffee now, will work on the tombstone patch. Cheers!

marvin-j97 commented 2 weeks ago

I just tried it in a new Go project:

package main

import (
    "fmt"
    "github.com/guycipher/k4/v2"
    "log"
    "time"
)

const (
    DB_PATH = "testdb"
)

func main() {
    numOps := 10000000

    db, err := k4.Open(DB_PATH, (1024*1024)*128, 3600, false, false)
    if err != nil {
        log.Fatalf("Error opening K4 database: %v", err)
    }
    defer db.Close()

    key := make([]byte, 20)
    value := make([]byte, 20)

    // Benchmark Put
    for i := 0; i < numOps; i++ {
        key = []byte(fmt.Sprintf("key%d", i))
        value = []byte(fmt.Sprintf("value%d", i))
        if err := db.Put(key, value, nil); err != nil {
            log.Fatalf("Error putting key: %v", err)
        }
    }
}

is my code.

It ended up writing a 40 GB WAL, and 3 SSTables with 18 GB, 18 GB and 7 GB respectively, and taking about 100s. That's roughly a space amplification of 200. It also used more than 6 GB of RAM.

erd --human --hidden testdb

 7.0 GiB ┌─ sstable_2.sst
16.8 GiB ├─ sstable_1.sst
17.1 GiB ├─ sstable_0.sst
40.5 GiB ├─ .wal
81.5 GiB testdb
time ./main

real    1m43,356s
user    6m24,043s
sys 1m19,227s

My system is i9 11900k, Samsung 990 EVO.

guycipher commented 2 weeks ago

Thank you! excellent run on great hardware. I can see something I’ve been seeing more frequently is the size of the sstables are huge compared to the configured threshold. In your example I can see the flush threshold of 128mb there should have been 340 ish sstables. That’s a bug, most definitely. When you write 6gb memory usage, that’s through the write operations? That’s not too bad for the amount of key values, in my mind. Could be optimized for less usage I’d imagine. The WAL is sorta a given, it will be large. each page is 1024*4 could me smaller for sure.

marvin-j97 commented 2 weeks ago

When you write 6gb memory usage, that’s through the write operations?

It's about 4 GB, after 10 seconds the writes are "finished" and it goes into the 90 seconds of ???, and goes to about 6 GB. Without having checked, I would expect RocksDB to use maybe 100-200 MB here.

The WAL is sorta a given

I have written just 400 MB of data. That's why every database rotates & truncates the WAL at some point (normally when doing memtable flushes). It's not feasible to keep a WAL that large.

guycipher commented 2 weeks ago

On close we flush memtable and wait for operations to finish that includes synchronization and background operations. In this case stated it shouldn’t be that long. It’s taking that long cause of how large the sstables are, if they followed through with the threshold it would be faster on closure. I am working on this bug currently. 18gb sstables are way off threshold. This is calculated obviously in the skip list as new entries are added.

Hm interesting. I could truncate wal after each flush this could work. Good idea !

guycipher commented 2 weeks ago

Honestly @marvin-j97 im very appreciative eh. Very. Thank you for taking the time to do this. Be sure to give ya a shoutout next release with these patches.

guycipher commented 2 weeks ago

Just noting this for the sstables flush size issue.

func (n *Node) Size() int {
    //size := len(n.key) + len(n.value) + len(n.forward)*int(unsafe.Sizeof(uintptr(0)))
    //if n.ttl != nil {
    //  size += int(unsafe.Sizeof(*n.ttl))
    //}
    //return size

    return int(unsafe.Sizeof(*n)) + len(n.key) + len(n.value)

}

Is giving a more consistent size of the skiplist nodes, thus a better flush guarantee from my tests. Essentially the problem with to little sstables is from my tests currently because the skiplist size isnt be reported as the threshold properly and even then every test, same tests provide different sizes. This works better. I am testing more and will commit shortly. Screenshot from 2024-11-12 14-17-39

guycipher commented 2 weeks ago

For the WAL file, because I want to assure you can backup your entire instance from a WAL it's hard to implement truncation. Yes there "could" be sstables, but what if there is not? I am thinking about this. Currently, the best bet to bring down the WAL file size is making the pager size page size for the entire system 1024. This is fine as encoded values under this will keep the files small, but yeah I need to bench it as the pager will have to create and link overflows and gather them on reads as well.

guycipher commented 2 weeks ago

I'm gonna change pager page size to 1024. Honestly its vastly inefficient as encoded operations, encoded key values are rarely I mean based on the application hit that and if they did the pager will create and link overflow pages, on read its all constant time to gather the page and overflows, I dont see why this would be inefficient. I think it would shrink the overall file sizes across the board. Screenshot from 2024-11-12 14-32-21

Also will be adjusting header size to 16 bytes as opposed to 128. An int64 in go takes up to I believe 8 bytes.

guycipher commented 2 weeks ago

One thing I noticed is that Go's Gob package which encodes the structures like operation, and keyvalue are not optimized for space, this is causing these structures to significantly grow in size after encoding. To counter this yes, protobuff, msgpack, something like this could be used or a custom solution.

guycipher commented 2 weeks ago

In 2.1.7 I tested 1 million key value pairs with your code. I got back a memtable of 95.3 MB. After the flush to disk the file was 1gb. This is definitely not the pager, I am seeing these encoded structures span multiple pages.

guycipher commented 2 weeks ago

Even with basic encoding and decoding methods not using gob though not fully optimized are still increasing each encoding up by a factor of approximately 371 times. It's hard to get this down as these structures are complex.

package main

import (
    "fmt"
    "github.com/guycipher/k4/v2"
    "log"
)

const (
    DB_PATH = "testdb"
)

func main() {
    numOps := 1000000

    db, err := k4.Open(DB_PATH, (1024*1024)*5, 3600, true, false)
    if err != nil {
        log.Fatalf("Error opening K4 database: %v", err)
    }
    defer db.Close()

    key := make([]byte, 20)
    value := make([]byte, 20)

    // Benchmark Put
    for i := 0; i < numOps; i++ {
        key = []byte(fmt.Sprintf("key%d", i))
        value = []byte(fmt.Sprintf("value%d", i))
        if err := db.Put(key, value, nil); err != nil {
            log.Fatalf("Error putting key: %v", err)
        }
    }
}
real    0m12.808s
user    0m10.124s
sys     0m6.882s

This creates Screenshot from 2024-11-12 15-53-10

Which is more acceptable. So toying with the flush size is pretty important and seeing the outcome. (10241024)5 works well, the files are roughly 56mb.

This doesn't help the case of the WAL file. It will still be rather large. Screenshot from 2024-11-12 15-56-02

For 1 million key value pairs the WAL is almost 1gb and all the sstables spread up are almost 1gb as well.

In the case again of WAL, we want to be able to recover K4 or rebuild it on another server from the WAL so truncating it wont work. It's possible to add some sort of checkpointing but not sure yet.

marvin-j97 commented 2 weeks ago

In the case again of WAL, we want to be able to recover K4 or rebuild it on another server from the WAL so truncating it wont work. It's possible to add some sort of checkpointing but not sure yet.

That's what the SSTables are for. To rebuild an LSM-tree you send over the SSTables and the current WAL fragment(s).

guycipher commented 2 weeks ago

Understood. So essentially every flush off the queue (in terms of K4) you'd truncate the WAL as now those entries are safe in an sstable. So curious when you reopen an instance you replay the WAL to populate memtable and essentially it will auto clean itself once it truncates and transitions to an sstable? I gotta wrap my head around that. Very interesting stuff. I'm obviously always worried about the actual sstables. I understand the reasoning here, in K4 we are populating double the data pretty much, which is sorta bloat.