sdleffler / qp-trie-rs

An idiomatic and fast QP-trie implementation in pure Rust.
Mozilla Public License 2.0
94 stars 24 forks source link

Introduce opt-in serialization through Serde #6

Closed tapeinosyne closed 7 years ago

tapeinosyne commented 7 years ago

Now that Rust has enshrined Serde 1.0 as the canonical serialization infrastructure, it's quite important for data structure crates to use it, as Serde support can seldom be implemented downstream.

This PR enables opt-in derivation of the Serialize and Deserialize traits through the serde feature flag. (The feature is named according to the interoperability guidelines from the language nursery.) The patch itself is nothing more than a few derive declarations hidden behind cfg attributes and sprinkled on relevant structs.

Although derived impls should be sound, it may still be desirable to test them; doing so would require tossing in a concrete serializer in the form of an optional dev dependency. I would be happy to amend the patch with such a test if the additional dependency is fine by you.

(On a unrelated note – and I apologize for interfering with repository housekeeping – there appears to be trailing whitespace in a few files. You may want to remove it, as trailing whitespace behaves poorly across different git and editor configurations, and often muddles diffs.)

sdleffler commented 7 years ago

Thanks for the PR! It'd be great if you could add some tests, I'm fine with the new dev dependency. I do have an additional concern, which is that QP-tries are fundamentally key-value map structures, and to my knowledge the Serde implementations for HashMap and BTreeMap don't expose the inner structure of the tree, instead serializing all keys/values into a list of pairs. Deserialization then takes the form of reading in a new map and inserting all the pairs. In addition, due to the way the QP-trie works internally, modifying the internal bitsets used for the sparse array format in their serialized form could cause a panic or worse the wrong results to be returned from get operations. If you'd be interested in implementing this, it should be trivial - there's a page on it here in their docs.

I was unaware of the trailing whitespace - it's definitely unintentional! Thanks for letting me know.

tapeinosyne commented 7 years ago

(Apologies for the delay.)

I switched the patch to pairwise serialization and added some tests. I was initially unpersuaded that this would prove necessary – partly because I had already written the tests – but after some thought I did find a case where the derived implementation could fail – or, rather, cause concrete serializers to fail.

The failure is somewhat orthogonal to the potential issues you describe. To address those:

@sdleffler […] modifying the internal bitsets used for the sparse array format in their serialized form could cause a panic or worse the wrong results to be returned from get operations.

No doubt, but a serializer that modifies data for storage and fails to recover the original value upon deserialization is, well, terribly behaved. It's not a serializer, it's a poltergeist!

the Serde implementations for HashMap and BTreeMap don't expose the inner structure of the tree, instead serializing all keys/values into a list of pairs […]

Although it is true that the standard library HashMap and BTreeMap are serialized by pair, to my knowledge that is not because they are associative rather than sequential, but because of visibility and structure: the internals are private to std, and not only are their layouts significantly more complex – what with buckets, tables, uninitialized fields, and so on – but they also rely on raw pointers, which are not part of the Serde data model.

My reading of the qp_trie code was admittedly superficial, but what structure I gleaned felt much simpler by comparison. Nevertheless, it is there that the issue lurks.

The types derived for would have comprised:

None require anything beyond derive. However, in our case, "sums and products of the above" means recursive structures. A trie with many branches can recurse deeply, and serializers are known to go knee-hapsed when met by deep recursion.

Now, consider a QP-trie containing keys with an increasingly long shared prefix:

k_1 = &[0];
k_2 = &[0, 0];
k_3 = &[0, 0, 0];
…
k_i = &[0; i];

Here, the keys cause enough branching that the end result will resemble not so much a trie as a thornbush. Already at i = 32, serde_json hits a recursion limit on deserialization, and no doubt other serializers could be similarly affected. (The recursion limit enforced by serde_json, at 128, is conservative but not unreasonable.)

So, a more indirect approach is indeed safer, if a bit defensive. The amended patch uses Serde's own serialize_map as an intermediary, and includes a test covering the failure described here.

To bring this to a conclusion, there remain a few matters to consider before merging:

sdleffler commented 7 years ago

Looks great! It'd be nice to have bincode in there as well - I wouldn't be opposed to having both serde_json and bincode tests. If you add the #[cfg(feature = "serde")] in there, though, I'll merge this now.

tapeinosyne commented 7 years ago

I'll include bincode directly. Would you be amenable to use bincode and Vec<u8> in quickcheck tests instead of serde_json and String, to cover a wider range of inputs? Also, would you prefer a less idiosyncratic name than "max_outdegree"?

sdleffler commented 7 years ago

How about "branching factor"? And yes, I would absolutely be amenable to wider test coverage - the wider the better. :P

tapeinosyne commented 7 years ago

Done. (On the tangent of test coverage, many of the existing tests only concern API boundaries and don't need visible internals. It might be nice to move them to a test folder, since lib.rs is getting rather crowded.)

sdleffler commented 7 years ago

That sounds like a good idea - I'll create a new issue w.r. that. Thanks again!