spacejam / sled

the champagne of beta embedded databases
Apache License 2.0
8.05k stars 383 forks source link

PageCache raw_buf's and Invertible Transforms #325

Open davidrusu opened 6 years ago

davidrusu commented 6 years ago

Hey there, I have a need for encryption at rest and I was hoping I could use sled for my storage layer.

I was looking through the code base for areas where it would make sense to wedge encryption in and it seemed like it should be right around where the compression is done since compressing an encrypted message is not something you'd want to do.

We could augment the LogReader::read_message and IoBufs::encapsulate methods to check for a config.use_encryption and then do the encryption, but between this encryption and compression, it looks like a pattern is beginning, ie. both compression and encryption are an invertible transforms acting on the raw message buffer to be written to the log.

I'm curious what your thoughts would be on exposing some InvertibleTransform trait e.g:

trait InvertibleTransform {
    fn forward(&[u8]) -> Vec<u8>;
    fn backward(&[u8]) -> Vec<u8>;
}

and replace the use_compression field in Config with a transforms field:

struct Config {
    ...
    transforms: Vec<dyn InvertibleTransform>
    ...
}

then for the forward pass, IoBufs::encapsulate would do something like:

let mut buf = raw_buf;
for transform in transforms.iter() {
    let buf = transform.forward(&buf)
}

and for the backwards pass, LogReader::read_message would iterate the transforms starting with the last transform and move backwards through the list:

let mut buf = raw_buf;
for transform in transforms.iter().rev() {
    let buf = transform.backward(&buf)
}

This would allow us to move the compression out of the PageCache crate if that's desirable (I noticed a few #[cfg(feature = "zstd")] macros which can be removed), and even allow users to plug in their own compression algorithms that may work better for their application e.g. Word-level compression if they are working with a bunch of text.

Also, more important for my use-case, encryption can be handled entirely outside of the database, this is important for cases where you want to build a distributed database on top of sled, distributed encryption with a same secret keys needs tight control over how a block cipher nonce is generated as a reused nonce can often mean catastrophic failure and a leaked secret key.

I'm curious to hear your thoughts on this change, I would be happy to do the work since I need an encrypted datastore for my own project, is this something you would be open to?

spacejam commented 6 years ago

Hey @davidrusu! This sounds really interesting, and I'm definitely a fan of the way that this lets the pagecache be much more flexible and supportive of interesting workloads like yours.

I like the idea of removing conditional compilation in principle, but I'm not sure about whether we should rip out zstd right now. I like that users can easily say they want compression, and I'd prefer if that functionality remained for the time time being, but maybe it should be removed at a later time.

I'm definitely open to configurable vectors of invertible transformers, and you can break the current API around compression, but please retain the ability for users to easily specify compression while using the zstd build flag in another way if you do break it. Let me know if you'd like any guidance on the implementation! It seems like you have a good idea to begin with :)

davidrusu commented 6 years ago

Great, I'll get to work on an initial implementation that you can comment on!

davidrusu commented 6 years ago

Update: I've been distracted with a refactor/rewrite on another project, I'll be focusing on this work next.

spacejam commented 6 years ago

@davidrusu take your time! I'm doing some pretty heavy-duty architecture rethinking in order to support transactions, watch semantics, and efficient incremental checkpointing more cleanly. So, this whole house could be about to get mixed up a bit! But the code that touches this functionality should not be too effected by all this stuff flying around, so I don't think it will cause too many merge conflicts or anything for you. If it does I'll help deal with them!

StevenDoesStuffs commented 4 years ago

I was thinking maybe instead of taking a path, the config or Db could take some struct that can read, write, and seek instead? That way it can be completely abstracted away from this library.