khonsulabs / bonsaidb

A developer-friendly document database that grows with you, written in Rust
https://bonsaidb.io/
Apache License 2.0
1.01k stars 37 forks source link

Refactor Documents and Views to better utilize Nebari #250

Open ecton opened 2 years ago

ecton commented 2 years ago

Closes #76. Closes #225.

The primary goal of this PR is to improve the speed of view indexing (See #251 for more info) by tackling #76 in such a way that it can be executed safely without fsync.

Now that work has been done, the goals are slightly different:

Document Storage:

Documents are no longer serialized in a wrapper document type. Instead, the documents tree is now a versioned tree with an embedded index that stores the document's hash. The Revision's id is now the versioned tree's sequence_id.

This means that instead of simply pulling a document out of the database and deserializing it, we must pull the value and index out for a key and combine it with the key to create our document.

The other major change is introduced by the constraints of working within Nebari's modification system. Because we don't have access to the index for a key we're about to set, most of the logic for creating the OperationResult has been moved outside of the CompareSwap operation.

View Storage:

Views have been refactored to store the reduced value in Nebari through use of an embedded index. Instead of storing the entire ViewEntry structure in the view, we now only store the serialized Vec<Entrymapping>. The major change here is that Nebari will now reduce the stored index via the new ViewIndexer. The changes haven't been made to reduce/reduce_grouped yet to use Nebari's native reduce function -- but that is the inspiration for these changes.

When retrieving a view entry, we reconstruct the ViewEntry using the stored index to maintain compatibility with the existing code that worked with the ViewEntry structure.

These are a lot of remaining tasks:

ecton commented 2 years ago

I've been starting work on a new file format that is my best theorycratt at something that could sit beneath Nebari -- https://github.com/khonsulabs/sediment. At its core is the basic idea that while fsync is happening, other transactions can proceed with updating the database, and then be batch-synced to confirm. This would make the fsyncs on each thread take on average the normal time for a sync, but now transactions will be able to be batched.

That core idea is actually somewhat compatible with the append-only format, except that only one writer can be modifying the tree at any given moment. I attempted to bring this idea into Nebari without the new project today, but I ran into another issue that Sediment wouldn't suffer from: multi-file synchronizations.

The reason my work today didn't do much is that each tree file is still being synced for each write. I don't have a good way to batch these operations at the moment, but it's one of the things Sediment aims to solve. I may come up with an idea in the meantime and try again -- but the more I think about Sediment the more I'm hopeful it will be able to be significantly better than an append-only format, so I probably still want to get there anyways.