RUST-1132 Implement `DeserializeSeed` for owned and borrowed raw documents

isabelatkinson commented 9 months ago

This PR adds a new SeededVisitor type to deserialize into raw documents/arrays using a single buffer. There are no tests added as the existing corpus/serde tests already provide coverage for deserializing into these types. I did update the test runners in this library and in the driver (for which I will make a separate PR) to deserialize test JSON into raw documents for good measure, which, in addition to adding a bit more testing of this implementation, should be a small performance improvement for spec tests.

Some very basic benchmarking shows a 50% speed improvement when deserializing a JSON object with 80 layers of object nesting into a RawDocumentBuf. I will follow up with some more thorough benchmarking.

EDIT: Further benchmarking against the deep_bson.json and flat_bson.json files generated for the BSON microbenchmarks consistently shows that this new implementation is about three times faster than the existing implementation when deserializing those files into a RawDocumentBuf.

isabelatkinson commented 9 months ago

I've done some investigation into merging the visitors (which I think is a good idea), but I've run into a few problems that make this more complicated than expected. The crux of the issue is that the RawBsonVisitor and SeededVisitor are doing slightly different things: the former can take any value that can be turned into a RawBsonRef variant and represent it as such, but the latter (as currently designed) is only intended to visit maps and sequences at the top-level and treats any other type as a nested value. This difference can be bridged in most cases, but borrowed values present a challenge because of the metadata required in their byte representation within a raw document.

Take, for example, an attempt to deserialize a borrowed string into RawBsonRef. The RawBsonVisitor implementation has no problem doing this because it just creates a RawBsonRef::String with a reference to the string upon the call to visit_borrowed_str. However, if the SeededVisitor is used instead, a call to visit_borrowed_str will append length bytes and a null-terminator byte, which requires converting the string into an owned value.

This is definitely not an intractable problem. The SeededVisitor could differentiate between top-level and nested values for variants that can borrow data (as it currently does for binary) and return bytes/byte slices as such, and then the byte parser that interprets the bytes as RawBson/RawBsonRef could have branching logic based on whether the given value is top-level or nested within a document. My instinct here, though, is that implementing all of that special logic (plus the indirection we'd be adding by needing to go through bytes) isn't much of an improvement on the duplicated visitor situation that exists in this PR as-is. I'm also generally hesitant to add more logic in the BSON library that constructs or picks apart bytes for the sake of correctness and maintainability.

Wanted to sanity check here before sinking another day or two into this work, which has already taken me longer than expected 🙂 @abr-egn WDYT about not unifying the visitors right now and possibly doing so in the future as part of a larger cleanup of this library's internals?

abr-egn commented 9 months ago

Wanted to sanity check here before sinking another day or two into this work, which has already taken me longer than expected 🙂 @abr-egn WDYT about not unifying the visitors right now and possibly doing so in the future as part of a larger cleanup of this library's internals?

That makes sense to me. Thank you for taking the time to investigate this path!

mongodb / bson-rust

RUST-1132 Implement `DeserializeSeed` for owned and borrowed raw documents #433