khonsulabs / bonsaidb

A developer-friendly document database that grows with you, written in Rust
https://bonsaidb.io/
Apache License 2.0
1.01k stars 37 forks source link

Project Status #262

Closed D1plo1d closed 1 year ago

D1plo1d commented 1 year ago

Hey, just stumbled across BonsaiDB and it looks really neat! The last commit was in August so I just wanted to check before I get too far into it, is this project is still being maintained?

ecton commented 1 year ago

Hi, thank you for asking! It is still in active development, but it's definitely had its progress slowed. There is a pending file format redesign that I plan on offering at least migration tools to help migrate anyone who is currently using BonsaiDb to the new file format.

Since I haven't really documented this on GitHub anywhere, here's a rough timeline of what happened:

Here's the list of what needs to be done for this new storage layer to be integrated:

I hope this assuages any fears about BonsaiDb being worked on. But, I also completely understand if this lack of certainty regarding its performance deters people from trying it out.

At the end of the day, each time I think of building a project with Rust, I still would reach for BonsaiDb even in its current state. It's that basic love of what I've built that will keep me going on this project for a long time.

D1plo1d commented 1 year ago

First off thank you for your comprehensive reply, that was all the information I'd hoped for and more :smile:

I hope this assuages any fears about BonsaiDB being worked on.

It does - there's still a small concern of bus count in the back of my head but it's exciting to hear all of this continued thought has been put into the project. Honestly, it sounds like a massive undertaking for a single developer - I'm glad you were able to notice the burn out and take a break when you needed it.

Having re-developed my application in a few different databases in search of a suitable embedded DB, I'm quite excited by the high level struct-centric interfaces, ACID transactions, and versioned map/reduce architecture of BonsaiDB. I still have to validate that BonsaiDB can run on the extremely memory-constrained embedded platform I'm aiming to support, but regardless I am excited to hear that this project is continuing to receive attention - it feels on the right track in many important respects.

With respect to Sediment, I wanted to ask if you had any thoughts on ReDB which I think came out after your work on Sediment began. I haven't seen it mentioned anywhere yet in your stuff so I at least wanted to make sure it was at least on your radar.

ecton commented 1 year ago

It does - there's still a small concern of bus count in the back of my head but it's exciting to hear all of this continued thought has been put into the project. Honestly, it sounds like a massive undertaking for a single developer - I'm glad you were able to notice the burn out and take a break when you needed it.

I definitely would like to improve on the bus count! There are a few people who have been dabbling at contributing to BonsaiDb, and I'm hoping that after development around my new format stabilizes, I might be able to attract more people to those projects as well. It's pretty understandable for someone to not want to learn a mountain of code when its future is uncertain.

I still have to validate that BonsaiDB can run on the extremely memory-constrained embedded platform I'm aiming to support

One contributor was able to get BonsaiDb running on a RaspberryPi. I don't have much embedded hardware yet, but it's something I've been wanting to tinker with more in the coming years. The main limitation for running BonsaiDb in embedded environments is its reliance on std. If you run into any specific issues, please don't hesitate to open any issues!

With respect to Sediment, I wanted to ask if you had any thoughts on ReDB which I think came out after your work on Sediment began. I haven't seen it mentioned anywhere yet in your stuff so I at least wanted to make sure it was at least on your radar.

It did come out in the midst of my experiments, and it looks like a great project. One serious thought I still have is whether Nebari should exist, or whether BonsaiDb should just use another database format. There are two main arguments for pursuing my new format are:

I feel like Sediment has a unique offering that's worth exploring, and I'm hopeful that I finally have the right combination of strategies to get the performance I'm looking for. If I don't, I'll definitely be considering using another format again.

I hope your experiments are successful at getting BonsaiDb running on that hardware!

D1plo1d commented 1 year ago

That annotated B+ Tree is a really neat innovation, certainly not something I'd stumbled onto before.

Thanks for explaining all that. I did a little highly unscientific testing. Bonsai with an empty database allocated about 2MB of ram (I used jemalloc to collect allocation stats) which should fit nicely in my systems' 64MB of ram.

It appears that memory usage scales in proportion to the database size, after a few hundred thousand inserts I had about 11MB of memory allocated. Should it stop growing memory usage at some point or is it linear with number of db entries (eg. Due to some in-memory BTree or something)?

I empirically observed the slowness you talked about in your blog post but I won't know how much that will affect me until I get some real hardware to test on - if I remember correctly the system runs at 400mhz so it should be many times slower then my laptop but that might still be fine for my very limited throughput (around 10 writes per second to record some sensor data).

ecton commented 1 year ago

Oops, I just added issue #263 to allow configuring the size of the internal cache. It currently will expand to 2,000 entries, each which can hold up to 160k each. Unless you're writing large payloads, you can assume the maximum cache size should end up being roughly 2,000 * average document size. Most of the other things that BonsaiDb keeps in memory are small and shouldn't grow based on the data size.

The new format I'm designing will have some increased memory usage to keep track of various on-disk state, but it should still be able to be used comfortably in a low-memory environment.

I'm very happy to hear that you were able to get it working! One other thing to note about speed is if you're testing on macOS: Currently BonsaiDb issues an fcntl(F_FULLFSYNC) because that's what Rust's File::sync_data() does under the hood. This is absolutely correct behavior for true ACID compliance, but it's known to be slow compared to fsync on Linux. This is made even worse by BonsaiDb currently requiring two fsyncs.

Syncing data is still slow on Linux, but it's markedly slower on macOS. Ultimately, ~10 writes per second should never have any problems with BonsaiDb, assuming the underlying fsync operation succeeds in a reasonable amount of time. I'm very hopeful you won't have any problems on real hardware.

D1plo1d commented 1 year ago

Oops, I just added issue https://github.com/khonsulabs/bonsaidb/issues/263 to allow configuring the size of the internal cache. It currently will expand to 2,000 entries, each which can hold up to 160k each. Unless you're writing large payloads, you can assume the maximum cache size should end up being roughly 2,000 * average document size. Most of the other things that BonsaiDb keeps in memory are small and shouldn't grow based on the data size.

Interesting, I was about to change the hard-coded cache size locally but in re-running my test program to get a baseline I found that the memory usage had dropped back to 3MB.

It looks like my full dataset of 100s of thousands of entries is still there so it seems that the memory increase I was experiencing may be a memory leak related to large numbers of inserts (in the 100s of thousands). This is fine for my useage as I can just periodically restart my server (it would take almost a year to do as many inserts as my example did) but I thought it worth reporting anyways just as a heads-up.

Edit: Wait, I just realizing this may simply be due to having a cold cache. I will try loading 2000 entries and see if the problem returns.

Edit 2: I've added some cache warming that gets the first 2000 entries - the memory useage remains the same. So it appears my initial assessment that this is not a cache size problem may be correct.

ecton commented 1 year ago

Interesting. I am not aware of any memory leaks. On the main branch, I worked on inserting truly massive sets of data and did not notice memory leaks. If you end up having any further observations, please let me know!

D1plo1d commented 1 year ago

Will do!