jltsiren / simple-sds

Simple succinct data structures (in Rust)
MIT License
47 stars 8 forks source link

Would you be willing to add support for `serde` compatibility? #12

Open theJasonFan opened 2 years ago

theJasonFan commented 2 years ago

Hi @jltsiren,

First thank you for implementing simple-sds!

Would you be willing to accept a PR to add compatibility for with serde? I understand that compatibility with data formats such as json provided by serde_json would not make much sense; but bincode offers quite compact serialized representations. My thought here would be to use the #[serde(with = ... )] variant attributes and implement the appropriate interfaces to make simple-sdsdata structures work with derive for any structs that contain them.

I'm working on a project that uses simple-sds bit/int/raw vectors and have been using the with annotations + bincode to serialize data-structures. The thought would also be, for rank/select supported bit vectors, to serialize the bits only and build rank/select support at deserialization time.

Thank you again for your work.

jltsiren commented 2 years ago

I'm not sure about this.

My own needs for serialization are roughly these:

As far as I understand, Serde follows a different philosophy. I've had bad experiences with adding features I don't use myself. In the long term, such features tend to stop working properly, especially if they involve conditional compilation. At the same time, they make the code more difficult to change and maintain.

Can Serde support be added without too much maintenance burden?

theJasonFan commented 2 years ago

Thank you for the quick response. A couple thoughts:

Can Serde support be added without too much maintenance burden?

AFAIK, serde compatibility can be added to RawVector, BitVector, and IntVector with simple #[derive(Serialize, Deserialize)]. annotations where the structs are defined. The code added would just be annotations, and the actual serialization formats / protocols are offloaded to the serde ecosystem. I can take a closer look to see if there may be pain-points with rank/select support.

Fast and space-efficient serialization for multi-gigabyte structures

This would have to be benchmarked, but we have been using serde + bincode to serialize and deserialize multi-gigabye sds vectors without issue -- the sds vectors are the same size serialized as is in memory.

Interoperability... file formats do not change... simple memory-mapped files...

Off the top of my head, I do not think serde compatibility vis-a-vis #[derive(Serialize, Deserialize)] annotations would address these needs. However, my thought here is to add serde compatibility alongside the serialization formats you have implemented.

My overall thought is that adding serde compatibility alongside your serialization APIs makes development easy for downstream users like myself. If made compatible with serde, adding serialization/deserialization functionality to any struct that contains a simpel_sds::{RawVector, BitVector, IntVector} would just be a one-line#[derive(Serialize, Deserialize)]` annotation on struct definitions.

jltsiren commented 2 years ago

I was thinking about the maintenance burden if the in-memory data structures change. That would not be a breaking change from my point of view if the interfaces and the serialization formats remain the same. If you derive Serde serialization, any changes in in-memory structures would break compatibility with old files. I guess the question is how much effort it would take to implement serialization manually and always serialize the same data as when using my interface.

theJasonFan commented 2 years ago

Ah yes, that would be quite non-trivial --- I understand your concern now.

One possible, but inelegant, solution, would be to implement "stable" structs that mirror "in-memory" structs that could change implement serde::Serialize. The "stable" structs that are guaranteed (by convention) to not change, or change extremely infrequently. So maintenance involves maintainingFrom trait implementations to map between the "stable" and "in-memory" structs.

Then again, serde::Serialize only provides/exposes a data model and interface to which other serialization libraries can drive. These libraries in the serde ecosystem could very well introduce backwards-incompatible changes w.r.t how the compatible structs are serialized.