mozilla / rkv

A simple, humane, typed key-value storage solution.
https://crates.io/crates/rkv
Apache License 2.0
310 stars 53 forks source link

Potential performance issue: using serde's generic sequence (de)serialization instead of serde's `bytes` support #204

Open thomcc opened 4 years ago

thomcc commented 4 years ago

I had been intending at one point to get y'all a PR for this as a way to dip my toes into rkv, which is a cool project. That said, recent events have made this unlikely, so I at least wanted to let you know of the issue.

Loading or persisting an instance of the Safe backend ultimately involves reading/writing instances of https://github.com/mozilla/rkv/blob/master/src/backend/impl_safe/snapshot.rs#L26-L36 and some other stuff to disk using bincode.

There's a performance issue here: In general, in serde serializing Vec<u8>, Box<[u8]> and &[u8] is surprisingly inefficient. This is because it can't specialize these for u8, and has to serialize the whole sequence generically. Serde actually has support for doing the efficient thing here, but you have to opt-in by accepting various visit_bytes methods on deserializers.

For example, by default serde will individually deserialize every byte like this:

fn visit_seq<V: SeqAccess<'de>>(self, mut visitor: V) -> Result<Buffer, V::Error> {
    let mut v = Vec::with_capacity(cautious::size_hint(visitor.size_hint());
    while let Some(b) = visitor.next_element()? {
        v.push(b);
    }
    Ok(Buffer(v.into()))
}

But they bytes visitors accept the whole chunk in one go, like this:

fn visit_bytes<E: de::Error>(self, v: &[u8]) -> Result<Buffer, E> {
    Ok(Buffer(v.into()))
}
fn visit_byte_buf<E: de::Error>(self, v: Vec<u8>) -> Result<Buffer, E> {
    Ok(Buffer(v.into()))
}

Which deserializes all bytes at once. Note that there's a helpful crate that provides wrappers that do this in part for you: https://github.com/serde-rs/bytes . It uses Vec<u8> for its owned type, and so if the distinction matters to you, you'll need to copy its impl and use Box<[u8]> instead. Its not much code..

It should be fully compatible with what you're doing now, only faster, but obviously you should test. My experience is that changing bincode code to use serde_bytes does not seem to impact the serialized format at all.