simd-lite / simd-json

Rust port of simdjson
Apache License 2.0
1.11k stars 82 forks source link

Consider using IndexMap: json order is random after a 33 keys in map #378

Open ritchie46 opened 2 months ago

ritchie46 commented 2 months ago

This python snippet wouldn't round trip if simd-json was used:

import json

obj = {c: None for c in "abcdefghijklmnopqrstuvwxyz1234567"}
s = json.dumps(obj)
assert json.loads(s) == obj

This is because Object uses a hashmap which doesn't preserve the order of the json string keys. This can be resolved by using I would like to consider using indexmap, which has excellent performance, AND guarantees index order.

This issue was discovered in: https://github.com/pola-rs/polars/issues/17439

If this is accepted, I am willing to make a PR.

Licenser commented 2 months ago

Hi ritchie,

by specification json objects are unordered sets but that also doesn't mean it has to be unordered. the order for the first 32 keys is an implementation detail. We use halfbrown that uses a simple vector for up to 32 elements which makes small objects, which are frequent with json, extremely fast and small (memory wise).

That's just for background :). Generally if we can make it optional and preserve the current behaviour I'm in, giving people more options is nice!

A feature to watch out for is known-key, it is quite powerful when using lookups of the same key repeatedly and it'd be good to preserve it

I could see a few ways to make this work:

1) a feature flag, probably the easiest but there is some chance of conflicts with other flags 2) a feature flag in halfbrown, same as above but it limits the interface change to the library for hashmaps that simd-json already uses 3) make the map type a generic - this will require more work and generally I like it but OTOH I'm not sure how well that would play out or how much work it would be 4) same as abovbe but inside halfbrown

ritchie46 commented 2 months ago

Ah, given that you also maintain halfbrown, maybe doing that upstream is nicest. Would a halfbrown based on IndexMap be worth it?

For me a feature flag would be fine. So that would be option 2. :eyes:

Licenser commented 2 months ago

perfect, sounds like we got a plan then :)!

Also #377 might be of interest for you, it keeps order as long as no mutations are done already (and it should "just work" with index map as well)

ritchie46 commented 2 months ago

Great! Thanks for being receptive to this. :)

For #377, I need a little bit more context. :sweat_smile: What does it do? What is the semantic difference between a borrowed value and a tape?

Licenser commented 2 months ago

the borrowed value is a fully mutable DOM equivalent to serde_json::Value; the tape value is a flat structire (a single vec) that represents the JSON document, since it has no nesting and doesn't require extra allocations, inserts, etc it is significantly faster.

the downside is that the tape is not easily mutable; the lazy value bridges the gab between the two :)