jcrist / msgspec

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
https://jcristharif.com/msgspec/
BSD 3-Clause "New" or "Revised" License
2.01k stars 59 forks source link

encode sort_keys argument #609

Closed evgenii-moriakhin closed 5 months ago

evgenii-moriakhin commented 6 months ago

Question

Hi I couldn't find any documentation on open\closed issues either

Is there any possibility to sort keys during encode operation? Like json.dumps(..., sort_keys=True)

jcrist commented 6 months ago

No sort_keys option currently exists. Can you say more about your use case for one?

dhirschfeld commented 6 months ago

I assume it's to make the output more human-readable.

My view is that msgspec output is designed to be machine-readable and in my own use-cases I use rich.print_json to pretty-print JSON (msgspec output), and optionally sort-keys, for human consumption e.g. output to console.

mishamsk commented 6 months ago

I can provide a use-case: we do a lot of snapshot testing, comparing json serialized AST tree’s. Since we frequently change the tree structure and often have to manually review diffs, we sort the keys before dumping the output. It is pretty easy to sort the keys after the fact, but obviously slower. Although, to be fair, not a big deal since it is only important in testing.

evgenii-moriakhin commented 6 months ago

My purpose for using key sorting, and why I asked, was to make a smooth transition in my work project from the standard Flask json serializer (based on the built-in json library) to msgspec. The Flask developers have made the default key sorting - https://github.com/pallets/flask/blob/main/src/flask/json/provider.py#L149

and I would like all our API users and testers not to notice the transition

but I don't find this feature mandatory. It would be great only as compatibility with json/ujson/orjson that have key sorting

noudin-ledger commented 6 months ago

I have exactly the same case as @evgenii-moriakhin, I'm trying to integrate msgspec into Flask applications. It is also useful for caching, enforcing the key orders can allow to have a quicker cache hit and improve the hit/miss ratio

jcrist commented 5 months ago

Update - I have a working implementation of this, only need to figure out a nice spelling.

In summary - we expose 3 levels of sorting:

  1. No Sorting. The default. Structured objects (structs/dataclasses/...) encode in the order of their defined fields, unordered collections (dicts, sets, ...) encode in the order of their current in-memory representation.
  2. Sort unordered collections alone. This mode ensures reproducible results across invocations by ordering items in semantically unordered collections (dicts and sets). This compromises a bit on speed, but ensures consistent output. This can be used for caching or hashing the output purposes since it's consistent, but not all keys are ordered for human-readable purposes.
  3. Fully sorted. Besides sorting dicts and sets, we also sort the field orders of all structured objects. This compromises more on speed, but ensures the output JSON has fields sorted alphabetically for all JSON object types.

The question is - what to name this kwarg and options?

For kwarg name, I don't want to use sort_keys since we're sorting more than keys, but do want "sort" to appear in the name. Right now I'm waffling between sort and sort_mode.

For values, I think I want False and True to correspond to 1 & 2 respectively above. I'm not sure what to call 3. "all"? "full"? "canonical"?

The reason I want True to not map to 3 (sort everything) is that I suspect most users will just pass in True, and for most use cases a consistent (but not necessarily fully sorted) output will suffice. Trying to make the most performant sorting option also the one people use. For prior art, this also matches what go's existing json library does.

# Example of potential kwarg naming and semantics
msgspec.json.encode(obj)  # defaults to no sorting
msgspec.json.encode(obj, sort=False)  # no sorting
msgspec.json.encode(obj, sort=True)  # only sorts dicts and sets, Structs remain in field order
msgspec.json.encode(obj, sort="canonical")  # sorts sets, dicts, and any object keys

Thoughts?

evgenii-moriakhin commented 5 months ago

I don't like the ability to pass both a boolean type and a string; from an API friendliness standpoint, it seems quite ambiguous.

At the same time, what alternatives can be suggested? The only things that come to mind are Enums or option flags like in https://github.com/ijl/orjson#option.

Neither of these two approaches is KISS-like, but they are more explicit (and more meaningful naming can be chosen), which, in my opinion, corresponds to the Zen of Python.

It is also worth considering whether there are plans for other serialization options in msgspec in the future? (Again, see example https://github.com/ijl/orjson#option).

In this case, the flags option seems the most advantageous, as it allows for expressing explicitness, meaningful naming, and the ability to extend serialization options in the future. This configuration approach with bitwise flags is also used in other popular libraries used in Python (not just orjson) - for example, libvirt.

mishamsk commented 5 months ago

I’d vote strongly in favor of an enum. Afaik this is much more widespread [in Python ecosystem] and explicit than binary flags like orjson uses. Flags & bits are more domain of lower level language like C. Unless you plan to support dict/set and structured objects sorting independently. Then flag it is.

P.S. curious why sorting structured types is a bigger performance impact than dicts/sets. Thought that for structured types it is a “compile” (encoder build) time impact, while dicts/sets it is an actual item sort.

jcrist commented 5 months ago

I personally don't like using bitwise flags as they don't feel pythonic, and am against the usage of enums (or enums alone) to avoid the need for additional imports just to set a config. There's lots of precedence in python for usage of a fixed set of literals to configure an option (hence the existence of typing.Literal), and we're already using that in msgspec in several places. See the uuid_format here for example.

P.S. curious why sorting structured types is a bigger performance impact than dicts/sets. Thought that for structured types it is a “compile” (encoder build) time impact, while dicts/sets it is an actual item sort.

The implementation I have currently doesn't do the sorting at compile-time because I didn't want to bother with it yet and assumed any code paths requiring sorting would be less performance oriented than the default no-sorting path. We could optimize this path further with structured types (easiest for msgspec.Struct types, a bit of a refactor for attrs/dataclasses needed before handling it there). That said, my assumption here is that structured types are best represented in the fixed order of their fields, and sorting is really only needed for unordered collections like dict/set. Making the most intuitive config option only affect these types felt right to me. It also matches what golang does (structs are encoded in field order, maps are sorted before encoding).

evgenii-moriakhin commented 5 months ago

There's lots of precedence in python for usage of a fixed set of literals to configure an option (hence the existence of typing.Literal), and we're already using that in msgspec in several places. See the uuid_format here for example.

However, you suggested using boolean types. Are there any reasons to use True/False for values instead of string literals? It could be useful if the argument was called sort_keys, but the name is being changed anyway.

jcrist commented 5 months ago

Sure, but boolean values are valid in literal types as well. This would be typed something like:

def encode(..., sort: Literal[True, False, "canonical"]=False): ...

I'm not attached to using True/False if there are some short string literals that make more sense. The implementation is essentially done, just need to do the hardest part, naming things :).

mishamsk commented 5 months ago

@jcrist

The implementation I have currently doesn't do the sorting at compile-time because I didn't want to bother with it yet and assumed any code paths requiring sorting would be less performance oriented than the default no-sorting path.

Got it, thanks for clarification. As I’ve posted, in my use case it is not performance critical indeed.

Rgd enums vs. literals. If you want zero imports, I would then vote for two independent kwargs, e.g. sort_collections and sort_dataclass_like (or maybe even sort_structs).

Not sure if a union of boolean & string is pythonic. Since you are not evolving an existing API, but creating a new one from scratch, there is nothing that forces you to mix types like this. What if this is stored as an app config - deserializing a config that may be a string and boolean may be ambiguous itself (depending on the format of course)

dhirschfeld commented 5 months ago

Would having two arguments work?

def encode(..., sort: bool, sort_options: Literal["collections", "all"]='collections'): ...

I'm not sure about the naming of the sort_options arguments but canonical doesn't seem quite right either (though maybe there are no good options for a descriptive name for unordered collections and struct fields!)


Edit: Just saw the above post 🤦 I guess it's slightly different in that there is only one sort argument and the second argument controls how the sorting is performed.

If you wanted control to sort struct fields independently of collections the sort_options argument could be easily extended to Literal["collections", "fields", "all"]

evgenii-moriakhin commented 5 months ago

I think splitting the sorting capability into separate options is excessive, and overall I like the approach with string literals without boolean values

sort: Literal["no", "exclude_structs", "canonical"] (as an example, the correct naming is what we are discussing right now).

it solves the problems of explicit and meaningful naming

jcrist commented 5 months ago

What if this is stored as an app config - deserializing a config that may be a string and boolean may be ambiguous itself (depending on the format of course)

I don't find this argument convincing - if we cared about this we'd only ever pass in strings as config variables to any function. A limited type system in some legacy config format shouldn't limit the options we can use in python.

I think splitting the sorting capability into separate options is excessive

I also would really prefer to keep this configured via one keyword argument. Beyond limiting the number of options a user needs to understand, there's no real use case for the case of sort_dataclass_like=True, sort_collections=False. The use cases are:

A few additional kwarg ideas:

# def encode(..., order: Literal[None, "deterministic", "sorted"] = None):
# could also call the kwarg `ordering`
encode(obj, order=None)   # could also call this "unordered", but I prefer None
encode(obj, order="deterministic")
encode(obj, order="sorted")

# def encode(..., sort_level: Literal[0, 1, 2] = 0):
# treats sorting as a knob to turn up. I don't really like this one.
encode(obj, sort_level=0)
encode(obj, sort_level=1)
encode(obj, sort_level=2)

Right now I'm leaning towards the former. I agree it reads better than the sort kwarg I originally proposed above, and more explicitly describes the use cases for each mode. Also, by not explicitly describing what's being sorted in each case it leaves it open to add additional normalization steps as needed for different protocols to achieve a deterministic output.

mishamsk commented 5 months ago
# def encode(..., order: Literal[None, "deterministic", "sorted"] = None):
# could also call the kwarg `ordering`
encode(obj, order=None)   # could also call this "unordered", but I prefer None
encode(obj, order="deterministic")
encode(obj, order="sorted")

I like this one