Closed isabelatkinson closed 8 months ago
I've done some investigation into merging the visitors (which I think is a good idea), but I've run into a few problems that make this more complicated than expected. The crux of the issue is that the RawBsonVisitor
and SeededVisitor
are doing slightly different things: the former can take any value that can be turned into a RawBsonRef
variant and represent it as such, but the latter (as currently designed) is only intended to visit maps and sequences at the top-level and treats any other type as a nested value. This difference can be bridged in most cases, but borrowed values present a challenge because of the metadata required in their byte representation within a raw document.
Take, for example, an attempt to deserialize a borrowed string into RawBsonRef
. The RawBsonVisitor
implementation has no problem doing this because it just creates a RawBsonRef::String
with a reference to the string upon the call to visit_borrowed_str
. However, if the SeededVisitor
is used instead, a call to visit_borrowed_str
will append length bytes and a null-terminator byte, which requires converting the string into an owned value.
This is definitely not an intractable problem. The SeededVisitor
could differentiate between top-level and nested values for variants that can borrow data (as it currently does for binary) and return bytes/byte slices as such, and then the byte parser that interprets the bytes as RawBson
/RawBsonRef
could have branching logic based on whether the given value is top-level or nested within a document. My instinct here, though, is that implementing all of that special logic (plus the indirection we'd be adding by needing to go through bytes) isn't much of an improvement on the duplicated visitor situation that exists in this PR as-is. I'm also generally hesitant to add more logic in the BSON library that constructs or picks apart bytes for the sake of correctness and maintainability.
Wanted to sanity check here before sinking another day or two into this work, which has already taken me longer than expected 🙂 @abr-egn WDYT about not unifying the visitors right now and possibly doing so in the future as part of a larger cleanup of this library's internals?
Wanted to sanity check here before sinking another day or two into this work, which has already taken me longer than expected 🙂 @abr-egn WDYT about not unifying the visitors right now and possibly doing so in the future as part of a larger cleanup of this library's internals?
That makes sense to me. Thank you for taking the time to investigate this path!
This PR adds a new
SeededVisitor
type to deserialize into raw documents/arrays using a single buffer. There are no tests added as the existing corpus/serde tests already provide coverage for deserializing into these types. I did update the test runners in this library and in the driver (for which I will make a separate PR) to deserialize test JSON into raw documents for good measure, which, in addition to adding a bit more testing of this implementation, should be a small performance improvement for spec tests.Some very basic benchmarking shows a 50% speed improvement when deserializing a JSON object with 80 layers of object nesting into a
RawDocumentBuf
. I will follow up with some more thorough benchmarking.EDIT: Further benchmarking against the
deep_bson.json
andflat_bson.json
files generated for the BSON microbenchmarks consistently shows that this new implementation is about three times faster than the existing implementation when deserializing those files into aRawDocumentBuf
.