jcrist / msgspec

A fast serialization and validation library, with builtin support for JSON, MessagePack, YAML, and TOML
https://jcristharif.com/msgspec/
BSD 3-Clause "New" or "Revised" License
2.1k stars 63 forks source link

Extending the built-in type set with (a) tagged union(s) isn't supported? #140

Closed goodboy closed 1 year ago

goodboy commented 2 years ago

My use case: handling an IPC stream of arbitrary object messages, specifically with msgpack. I desire to use Structs for custom object serializations that can be passed between memory boundaries.


My presumptions were originally:


Conclusions

Based on below thread:

This took me a (little) while to figure out because the docs didn't have an example for this use case, but if you want to create a Decoder that will handle a Union of tagged structs and it will still also process the standard built-in type set, you need to specify the subset of the std types that don't conflict with Struct as per @jcrist's comment in the section starting with:

This is not possible, for the same reason as presented above. msgspec forbids ambiguity.

So Decoder(Any | MyStructType) will not work.

I had to dig into the source code to figure this out and it's probably worth documenting this case for users?


Alternative solutions:

It seems there is no built-in way to handle an arbitrary serialization encode-stream that you wish to decode into the default set as well as be able to decode embedded tagged Struct types.

But, you can do a couple other things inside custom codec routines to try and accomplish this:

jcrist commented 2 years ago

I'm sorry, I'm not sure I understand this issue? What are you trying to do here?

but if you want to create a Decoder that will handle a Union of tagged structs and it will still also process the standard built-in type set, you need to do something like

Note that Decoder(Any | anything) is equal to Decoder(Any) (which is the same as Decoder() or the default msgspec.json.decode(...)) - only the default types will be decoded.

In [1]: import msgspec

In [2]: class Point(msgspec.Struct):
   ...:     x: int
   ...:     y: int
   ...: 

In [3]: msg = b'{"x": 1, "y": 2}'

In [4]: msgspec.json.decode(msg, type=Point)  # returns a Point object
Out[4]: Point(x=1, y=2)

In [5]: from typing import Union, Any

In [6]: msgspec.json.decode(msg, type=Union[Point, Any])  # returns a raw dict, same as the default below
Out[6]: {'x': 1, 'y': 2}

In [7]: msgspec.json.decode(msg)
Out[7]: {'x': 1, 'y': 2}
goodboy commented 2 years ago

@jcrist sorry if i'm not being clear.

It's your first case if I'm not mistaken:

msgspec.json.decode(msg, type=Point)

This will not decode, for example, tuple (or any other default built-in python type) like the default version:

[ins] In [7]:  msgspec.json.Decoder(Point).decode(msgspec.json.encode((1, 2)))
---------------------------------------------------------------------------
DecodeError                               Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 msgspec.json.Decoder(Point).decode(msgspec.json.encode((1, 2)))

DecodeError: Expected `object`, got `array`

However, if you do the union including list, then it works:

[ins] In [9]: from typing import Any

[nav] In [10]:  msgspec.json.Decoder(Point | list).decode(msgspec.json.encode((1, 2)))
Out[10]: [1, 2]

hopefully that's clearer in terms of what i was trying to describe πŸ˜‚


UPDATE: The more explicit desire i have is detailed in responses below.

goodboy commented 2 years ago

Ahh I also see what you mean now, which I didn't anticipate:

[nav] In [13]: class Point(msgspec.Struct, tag=True):
          ...:     x: int
          ...:     y: int

[nav] In [14]:  msgspec.json.Decoder(Point | Any).decode(msgspec.json.encode(Point(1, 2)))
Out[14]: {'type': 'Point', 'x': 1, 'y': 2}

[nav] In [15]:  msgspec.json.Decoder(Point).decode(msgspec.json.encode(Point(1, 2)))
Out[15]: Point(x=1, y=2)

That actually is super non-useful to me; i would expect the Point decode to still work no? Is there no way to create a decoder that will decode built-ins as well as custom tagged (union) structs?

Like, does a struct always have to be the outer most decoding type?

I was actually going to create another issue saying something like embedded structs don't decode (like say a Struct inside a dict) but I'm seeing now that it's actually this limitation that's the real issue?

goodboy commented 2 years ago

I'm sorry, I'm not sure I understand this issue? What are you trying to do here?

More details of what I'm doing:

Maybe I'm having a dreadful misconception about all this πŸ˜‚

goodboy commented 2 years ago

Ahh so I know why I see why this feels odd, it seems the limitation is really due to the use of typing.Union?

from typing import Union

from msgspec import Struct

class Point(Struct, tag=True):
    x: float
    arr: list[int]

msgspec.json.Decoder(
    Union[Point] | list
).decode(msgspec.json.encode(Point(1, [2])))

Works just fine, but if you try Union[Point] | list | set or Union[Point] | dict is where you run into problems.. TypeError raised by the Union, union πŸ˜‚

What's more odd to me is that you can support Structs that contain dicts but not the other way around with tagged structs? Seems to me it should be possible to support a dict[str, Struct] where Struct is tagged?

goodboy commented 2 years ago

Ahh so one way to maybe do what I'd like is to use Raw inside of some top level "god" message type?

This python code i think replicates what I thought was going to be the default behavior with tagged union structs:

from contextlib import contextmanager as cm
from typing import Union, Any, Optional

from msgspec import Struct, Raw
from msgspec.msgpack import Decoder, Encoder

class Header(Struct, tag=True):
    uid: str
    msgtype: Optional[str] = None

class Msg(Struct, tag=True):
    header: Header
    payload: Raw

class Point(Struct, tag=True):
    x: float
    y: float

_root_dec = Decoder(Msg)
_root_enc = Encoder()

# sub-decoders for retreiving embedded
# payload data and decoding to a sender
# side defined (struct) type.
_decs:  dict[Optional[str], Decoder] = {
    None: Decoder(Any),
}

@cm
def init(msg_subtypes: list[list[Struct]]):
    for types in msg_subtypes:
        first = types[0]

        # register using the default tag_field of "type"
        # which seems to map to the class "name".
        tags = [first.__name__]

        # create a tagged union decoder for this type set
        type_union = Union[first]
        for typ in types[1:]:
            type_union |= typ
            tags.append(typ.__name__)

        dec = Decoder(type_union)

        # register all tags for this union sub-decoder
        for tag in tags:
            _decs[tag] = dec
        try:
            yield dec
        finally:
            for tag in tags:
                _decs.pop(tag)

def decmsg(msg: Msg) -> Any:
    msg = _root_dec.decode(msg)
    tag_field = msg.header.msgtype
    dec = _decs[tag_field]
    return dec.decode(msg.payload)

def encmsg(payload: Any) -> Msg:

    tag_field = None

    plbytes = _root_enc.encode(payload)
    if b'type' in plbytes:
        assert isinstance(payload, Struct)
        tag_field = type(payload).__name__
        payload = Raw(plbytes)

    msg = Msg(Header('mymsg', tag_field), payload)
    return _root_enc.encode(msg)

if __name__ == '__main__':
    with init([[Point]]):

        # arbitrary struct payload case
        send = Point(0, 1)
        rx = decmsg(encmsg(send))
        assert send == rx

        # arbitrary dict payload case
        send = {'x': 0, 'y': 1}
        rx = decmsg(encmsg(send))
        assert send == rx

I guess for me the more ideal default would be that the/some standard decoder is capable of handling tagged union structs and the built-in type set (which I've probably emphasized ad nauseam at this point πŸ˜‚). So for example I could still do my Msg.header business to explicitly limit which message types are allowed in my IPC protocol, but also be able to create a decoder that can (recursively) unwrap embedded structs when needed, instead of trying to do it myself in python.

But, as a (short term) solution I guess the above could be a way to get what I want?

The even more ideal case for me would be that you could embed tagged structs inside other std container data types (dict, list, etc.) and then as an option, a (default) tagged struct + built-ins decoder would be available to just take care of decoding everything automatically in some arbitrary serialization object-frame when needed.

goodboy commented 2 years ago

Heh, actually the more I think about this context-oriented msg type decoding policy, the more i like it. This kind of thing would play super well with structured concurrency.


msg: Msg

with open_msg_context(
    types=[IOTStatustMsg, CmdControlMsg],
    capability_uuid='sd0-a98sdf-9a0ssdf'
) as decoder:

    # this will simply log an error on non-enabled payload msg types
    payload = decoder.decode(msg)
jcrist commented 2 years ago

Sorry, there's a lot above, I'll try to respond to what I think are your current issues.

Like, does a struct always have to be the outer most decoding type?

What's more odd to me is that you can support Structs that contain dicts but not the other way around with tagged structs? Seems to me it should be possible to support a dict[str, Struct] where Struct is tagged?

This does work. All types are fully composable, there is no limitation in msgspec requiring structs be at the top level, or that structs can't be subtypes in containers. dict[str, SomeStructType] or dict[str, Union[Struct1, Struct2, ...]] fully work fine. If you have a reproducible example showing otherwise I'd be happy to take a look.

Works just fine, but if you try Union[Point] | list | set or Union[Point] | dict is where you run into problems.. TypeError raised by the Union

Side note - when posting comments referring to errors, it's helpful to include the full traceback so we're all on the same page. Right now I'm left guessing what you're seeing raising the type error.

First, there's no difference in encoding/decoding support between Unions of tagged structs and structs in general. Also, Union[SomeType] is always the same as just SomeType, no need for the extra union. So your simplified examples are:

import msgspec
from typing import Union

class Point(msgspec.Struct):
    x: int
    y: int

for typ in [Union[Point, list, set], Union[Point, dict], Union[int, list, dict]]:
    print(f"Trying a decoder for {typ}...")
    try:
        msgspec.json.Decoder(typ)
    except TypeError as exc:
        print(f"  Failed: {exc}")
    else:
        print("  Succeeded")

This outputs:

Trying a decoder for typing.Union[__main__.Point, list, set]...
  Failed: Type unions may not contain more than one array-like (list, set, tuple) type - type `typing.Union[__main__.Point, list, set]` is not supported
Trying a decoder for typing.Union[__main__.Point, dict]...
  Failed: Type unions may not contain both a Struct type and a dict type - type `typing.Union[__main__.Point, dict]` is not supported
Trying a decoder for typing.Union[int, list, dict]...
  Succeeded

Note that the error is coming from creating the msgspec.json.Decoder, not from creating the Union itself. The error messages are echoing the restrictions for unions that are described in the docs (https://jcristharif.com/msgspec/usage.html#union-optional).

In both cases the issue is that the union contains mutiple Python types that all map to the same JSON type with no way to determine which one to decode into. Both list and set map to JSON arrays - we can't support both in a union since this would lead to ambiguity when decoding and a JSON array is encountered. Same for Point | dict - both python objects encode as JSON objects, there's no efficient way to determine which type to decode into. int | list | dict is fine, since each of these python types maps to a different JSON type. Tagged Unions provides an efficient way to determine the type to decode at runtime, which is why only tagged structs can coexist within the same union.

I guess for me the more ideal default would be that the/some standard decoder is capable of handling tagged union structs and the built-in type set

This is not possible, for the same reason as presented above. msgspec forbids ambiguity. Say we try to support what you're asking, given the following schema:

import msgspec

from typing import Any

class Point(msgspec.Struct):
    x: int
    y: int

dec = msgspec.json.Decoder(Point | Any)  # right now this works, but ignores the struct completely since `Any` is present

Given a message like {"x": 1, "y": 2}, you might expect this to return a Point, since it matches the Point schema. But what if we get a message like {"x": 1, "y": 2.0}? Or {"x": 1}? These messages don't match Point, do we error? Or do we return a dict? What if the error is more nested down several sub-objects? Do we have to backtrack then decode into an Any type? And really, it's likely that the faulty messages are a type error in the sender, not a distinct new message that should be decodd separately. All of this is ambiguous, and can't be done efficiently, which is why msgspec forbids it.

I guess for me the more ideal default would be that the/some standard decoder is capable of handling tagged union structs and the built-in type set

Knowing nothing about what you're actually trying to achieve here, why not just define an extra Struct type in the union that can wrap Any.

import msgspec
from typing import Any, Union

class Msg(msgspec.Struct, tag=True):
    pass

class Msg1(Msg):
    x: int
    y: int

class Msg2(Msg):
    a: int
    b: int

class Custom(Msg):
    obj: Any

enc = msgspec.json.Encoder()
dec = msgspec.json.Decoder(Union[Msg1, Msg2, Custom])

def encode(obj: Any) -> bytes:
    if not isinstance(obj, Msg):
        obj = Custom(obj)
    return enc.encode(obj)

def decode(buf: bytes) -> Any:
    msg = dec.decode(buf)
    if isinstance(msg, Custom):
        return msg.obj
    return msg

buf_msg1 = encode(Msg1(1, 2))
print(buf_msg1)
print(decode(buf_msg1))

buf_custom = encode(["my", "custom", "message"])
print(buf_custom)
print(decode(buf_custom))

Output:

b'{"type":"Msg1","x":1,"y":2}'
Msg1(x=1, y=2)
b'{"type":"Custom","obj":["my","custom","message"]}'
['my', 'custom', 'message']

Note that the builtin message types (Msg1, Msg2) will only be decoded properly if they are top-level objects, since that's what the provided schema expects. Custom messages thus should only be composed of builtin types (dict/list/...) but can then be unambiguously handled. If you also want to handle e.g. lists of the above at the top level (or whatever) you could add that to the union provided to the decoder as well:

MsgTypes = Union[Msg1, Msg2, Custom]
# Decoder expects either one of the above msg types, or a list of the above msg types
decoder = msgspec.json.Decoder(Union[MsgTypes, list[MsgTypes]]) 
jcrist commented 2 years ago

In the future, large issue dumps like this that are rapidly updated are hard to follow as a maintainer. If you expect a concise and understanding response from me, please put in the effort to organize and present your thoughts in a cohesive manner. While the examples presented in this blogpost aren't 100% relevant for this repository, the general sentiment of "users should provide clear, concise, and reproducible examples of what their issue is" is relevant.

goodboy commented 2 years ago

If you expect a concise and understanding response from me, please put in the effort to organize and present your thoughts in a cohesive manner.

My apologies, I didn't know the root issue that I was seeing at outset, it's why I've tried to update things as I've discovered both using the lib and seeing what's possible through tinkering.

Also a lot of this is just thinking out loud as a new user, my apologies if that's noisy, hopefully someone else will find it useful if they run into a similar issue.

The main issue I was confused by was mostly this (and i can move this to the top for reference if you want):

I do think making some examples of the case I'm describing would be super handy to have in the docs as maybe more of an advanced use case?

The general sentiment of "users should provide clear, concise, and reproducible examples of what their issue is" is relevant.

Totally, originally I thought this was a simple question and now I realize it's a lot more involved; I entirely mis-attributed the real problem to something entirely different, hence my original issue title being ill-informed πŸ˜‚


In summary, my main issue was more or less addressed in your answer here, which is what I also concluded:

Note that the builtin message types (Msg1, Msg2) will only be decoded properly if they are top-level objects, since that's what the provided schema expects. Custom messages thus should only be composed of builtin types (dict/list/...) but can then be unambiguously handled. If you also want to handle e.g. lists of the above at the top level (or whatever) you could add that to the union provided to the decoder as well:

In other words you can't pass in an arbitrary: dict[dict[dict, Struct]]] and expect the embedded Struct to be decoded without defining the exact schema hierarchy (at least leading to the Struct field) ahead of time. So in some sense this also is similar to my question in #25.

Knowing nothing about what you're actually trying to achieve here, why not just define an extra Struct type in the union that can wrap Any.

So I think this is pretty similar to what i presented in the embedded Raw-payload example i put above, I was just originally look for any tagged union struct, anywhere in the encoded data to be automatically decoded no matter where is was situated in the composed data structure hierarchy.

So really, I guess what I am after now is some way to dynamically describe such schemas, maybe even during a struct-msg flow.

Again my apologies about this being noisy, not well formed, ill-attributed; I really just didn't know what the real problem was.

goodboy commented 2 years ago

@jcrist I updated the description to include the summary of everything, hopefully that makes all the noise, back and forth, more useful to onlookers 😎


To just finally summarize and answer all questions you left open for me:

Yes, this does work as long if you specify the schema ahead of time, but even still it's not clear to me how you would use some "top level" decoder to decode non-static-schema embedded Struct types. So you have to either know the schema or you have to create some dynamic decoding system as I showed in my longer example.


Agreed, I mis-attributed the error: msgspec.msgpack.Decoder(Struct | Any) works fine.


In both cases the issue is that the union contains mutiple Python types that all map to the same JSON type with no way to determine which one to decode into. Both list and set map to JSON arrays - we can't support both in a union since this would lead to ambiguity when decoding and a JSON array is encountered. Same for Point | dict - both python objects encode as JSON objects, there's no efficient way to determine which type to decode into.

Agreed, but with the case of tagged Struct this isn't true any more right because you can check for the tag_field and decide if it matches one in your struct registry no?

Tagged Unions provides an efficient way to determine the type to decode at runtime, which is why only tagged structs can coexist within the same union.

Ok so this sounds like what I'm asking for is supposed to work right?

This is not possible, for the same reason as presented above. msgspec forbids ambiguity. Say we try to support what you're asking, given the following schema:

But then you say it isn't and give an example with a non-tagged-Struct?

Knowing nothing about what you're actually trying to achieve here, why not just define an extra Struct type in the union that can wrap Any

Yes, this is more or less what I concluded except using Raw is more general and allows the default to be Any and using a header to specify custom/filtered struct types.

What if the error is more nested down several sub-objects? Do we have to backtrack then decode into an Any type? And really, it's likely that the faulty messages are a type error in the sender,

So i guess the problem here would be decode aliasing due to a tag field collision? I can see why you might was to just sidestep this entirely.

goodboy commented 2 years ago

What if the error is more nested down several sub-objects? Do we have to backtrack then decode into an Any type? And really, it's likely that the faulty messages are a type error in the sender,

So i guess the problem here would be decode aliasing due to a tag field collision? I can see why you might was to just sidestep this entirely.

Just as one final proposal, couldn't we just presume if you find a {"type": "CustomStruct"} that it should be decoded to CustomStruct or error and if it turns out that was an error by the sender having a "type" (or wtv tag_field is) key, then you just throw an equivalent error?

DecodeError(f"Can't decode {msgpack_obj} to type 'CustomStruct' did you send an object with '{tag_field}' set?")

And then the user will know either the serialized object is malformed or there is a collision they have to work around by changing the tag_field setting?