luketpeterson / fuzzy_rocks

A persistent datastore backed by RocksDB with fuzzy key lookup using an arbitrary distance function accelerated by the SymSpell algorithm
Apache License 2.0
11 stars 3 forks source link

Cannot deserialize values due to bincode constraints #3

Closed jannisbecker closed 2 months ago

jannisbecker commented 7 months ago

Hi there. According to https://github.com/bincode-org/bincode/issues/548 and https://docs.rs/bincode/2.0.0-rc.3/bincode/serde/index.html#known-issues, bincode doesn't support several of serde's derive macros, some of them have pretty common use. Unfortunately, this makes me unable to use fuzzy_rocks with my data.

Is there a way to use a different serde data format that does support serde metadata (and hence all of serde's functionality?)

luketpeterson commented 7 months ago

Hello fellow Tokyo resident. One option is to use some indirection - ie. use a ValueT in the table that can be deserialized with Bincode (like a guid), which is then the key to access the data you want in another table or datastore, encoded differently. Admittedly that's not ideal.

Do you have a preference on an alternative format? I found this: https://github.com/djkoloski/rust_serialization_benchmark And it looks like postcard turns in respectable performance & size, and has good support. I didn't look closely enough to see if it support the metadata you need.

I don't think I'll have time to look into this for a couple weeks, but if you raise a PR, I will merge it. (Just make it a Cargo feature, so people have the option of keeping compatibility with their bincode-formatted tables)

BTW: do you ever go to the Tokyo Rust meetups? https://www.meetup.com/tokyo-rust-meetup/

jannisbecker commented 7 months ago

Hi! Unfortunately I dont live in Tokyo at this time, had to go back due to expiring visa 😭 I can't say I have a preference, I'm not too versed with the different formats either, but I would fork the code and try out one or two. Porting to a different format shouldn't be all that hard anyway, right?

A friend recommended me messagepack in particular, I'll test it out and see if it works (fast enough as well). Since it's a json-like representation, it would support all the serde features like json does, too.

I'll check out the rust meetups once I'm back in Japan, thanks!

jannisbecker commented 7 months ago

Would you prefer if I make a complete replacement of all uses of bincode for messagepack or is it enough if I just swap the de/ser parts for where the ValueT is involved? I think the letter is what I'll do for testing for now, and if it works well, we can port the rest too (but that's just for cleanness and removing one dependency's sake)

luketpeterson commented 7 months ago

Porting to a different format shouldn't be all that hard anyway, right?

I don't expect it will be bad.

There are a few places where I actually peek into the raw bytes, rather than doing a full deserialization. Basically everywhere I call anything in https://github.com/luketpeterson/fuzzy_rocks/blob/main/src/bincode_helpers.rs That being said, I don't remember how much of a difference it actually makes to the bottom line, so you might just run the deserializer for simplicity. When I wrote this 2 years ago my use case was incredibly performance-sensitive, but if you just need something reasonable then I don't think it's totally necessary to go to these lengths.

A friend recommended me messagepack

That sounds like a reasonable choice to me if you think it satisfies the other needs.

According to the benchmarks, both msgpacker and rmp-serde are within a factor of 1.5 of Bincode, and a factor of 2 (except at serialization) of the best formats in the benchmark. And the format seems to be well supported and documented.

As to which MessagePack crate to use, the benchmarks say the performance is in the same league, with msgpacker winning by a tiny bit on deserialization. But the difference isn't huge on any benchmark. So I think it comes down to the fact that rmp-serde is integrated with serde (and thus depends on it) and msgpacker isn't (and therefore doesn't). I don't have an opinion myself. People complain serde is slow to compile, but it's tiny compared to the time taken to compile RocksDB.

Would you prefer if I make a complete replacement of all uses of bincode for messagepack or is it enough if I just swap the de/ser parts for where the ValueT is involved?

My feeling is that it probably makes sense to use one format or the other. Like you said keeping the dependencies minimal. But it's not a strong opinion on my part, just a feeling.

Thanks.

luketpeterson commented 7 months ago

Looking a little closer, a lot of people just use the rmp crate directly, without depending on serde at all. So perhaps this is the best-supported option with 12M+ downloads.

luketpeterson commented 5 months ago

@jannisbecker With https://crates.io/crates/fuzzy_rocks/0.3.0 I added the ability to provide a custom coder object, and I implemented a coder for MessagePack.

In practice, BinCode is still 80% faster and the compressed database is only 10% bigger, so I left it as the default, but you can use the msgpack feature to enable MessagePack encoding. Just be careful not to build a table with one encoding and search it with another.

Let me know if this works or if you have any other ideas to improve it.