Closed goodboy closed 1 year ago
I'm sorry, I'm not sure I understand this issue? What are you trying to do here?
but if you want to create a Decoder that will handle a Union of tagged structs and it will still also process the standard built-in type set, you need to do something like
Note that Decoder(Any | anything)
is equal to Decoder(Any)
(which is the same as Decoder()
or the default msgspec.json.decode(...)
) - only the default types will be decoded.
In [1]: import msgspec
In [2]: class Point(msgspec.Struct):
...: x: int
...: y: int
...:
In [3]: msg = b'{"x": 1, "y": 2}'
In [4]: msgspec.json.decode(msg, type=Point) # returns a Point object
Out[4]: Point(x=1, y=2)
In [5]: from typing import Union, Any
In [6]: msgspec.json.decode(msg, type=Union[Point, Any]) # returns a raw dict, same as the default below
Out[6]: {'x': 1, 'y': 2}
In [7]: msgspec.json.decode(msg)
Out[7]: {'x': 1, 'y': 2}
@jcrist sorry if i'm not being clear.
It's your first case if I'm not mistaken:
msgspec.json.decode(msg, type=Point)
This will not decode, for example, tuple
(or any other default built-in python type) like the default version:
[ins] In [7]: msgspec.json.Decoder(Point).decode(msgspec.json.encode((1, 2)))
---------------------------------------------------------------------------
DecodeError Traceback (most recent call last)
Input In [7], in <cell line: 1>()
----> 1 msgspec.json.Decoder(Point).decode(msgspec.json.encode((1, 2)))
DecodeError: Expected `object`, got `array`
However, if you do the union including list
, then it works:
[ins] In [9]: from typing import Any
[nav] In [10]: msgspec.json.Decoder(Point | list).decode(msgspec.json.encode((1, 2)))
Out[10]: [1, 2]
hopefully that's clearer in terms of what i was trying to describe π
UPDATE: The more explicit desire i have is detailed in responses below.
Ahh I also see what you mean now, which I didn't anticipate:
[nav] In [13]: class Point(msgspec.Struct, tag=True):
...: x: int
...: y: int
[nav] In [14]: msgspec.json.Decoder(Point | Any).decode(msgspec.json.encode(Point(1, 2)))
Out[14]: {'type': 'Point', 'x': 1, 'y': 2}
[nav] In [15]: msgspec.json.Decoder(Point).decode(msgspec.json.encode(Point(1, 2)))
Out[15]: Point(x=1, y=2)
That actually is super non-useful to me; i would expect the Point
decode to still work no?
Is there no way to create a decoder that will decode built-ins as well as custom tagged (union) structs?
Like, does a struct always have to be the outer most decoding type?
I was actually going to create another issue saying something like embedded structs don't decode (like say a Struct
inside a dict
) but I'm seeing now that it's actually this limitation that's the real issue?
I'm sorry, I'm not sure I understand this issue? What are you trying to do here?
More details of what I'm doing:
Struct
s along side other built-in types over a streamMaybe I'm having a dreadful misconception about all this π
Ahh so I know why I see why this feels odd, it seems the limitation is really due to the use of typing.Union
?
from typing import Union
from msgspec import Struct
class Point(Struct, tag=True):
x: float
arr: list[int]
msgspec.json.Decoder(
Union[Point] | list
).decode(msgspec.json.encode(Point(1, [2])))
Works just fine, but if you try Union[Point] | list | set
or Union[Point] | dict
is where you run into problems.. TypeError
raised by the Union
, union π
What's more odd to me is that you can support Struct
s that contain dicts
but not the other way around with tagged structs? Seems to me it should be possible to support a dict[str, Struct]
where Struct
is tagged?
Ahh so one way to maybe do what I'd like is to use Raw
inside of some top level "god" message type?
This python code i think replicates what I thought was going to be the default behavior with tagged union structs:
from contextlib import contextmanager as cm
from typing import Union, Any, Optional
from msgspec import Struct, Raw
from msgspec.msgpack import Decoder, Encoder
class Header(Struct, tag=True):
uid: str
msgtype: Optional[str] = None
class Msg(Struct, tag=True):
header: Header
payload: Raw
class Point(Struct, tag=True):
x: float
y: float
_root_dec = Decoder(Msg)
_root_enc = Encoder()
# sub-decoders for retreiving embedded
# payload data and decoding to a sender
# side defined (struct) type.
_decs: dict[Optional[str], Decoder] = {
None: Decoder(Any),
}
@cm
def init(msg_subtypes: list[list[Struct]]):
for types in msg_subtypes:
first = types[0]
# register using the default tag_field of "type"
# which seems to map to the class "name".
tags = [first.__name__]
# create a tagged union decoder for this type set
type_union = Union[first]
for typ in types[1:]:
type_union |= typ
tags.append(typ.__name__)
dec = Decoder(type_union)
# register all tags for this union sub-decoder
for tag in tags:
_decs[tag] = dec
try:
yield dec
finally:
for tag in tags:
_decs.pop(tag)
def decmsg(msg: Msg) -> Any:
msg = _root_dec.decode(msg)
tag_field = msg.header.msgtype
dec = _decs[tag_field]
return dec.decode(msg.payload)
def encmsg(payload: Any) -> Msg:
tag_field = None
plbytes = _root_enc.encode(payload)
if b'type' in plbytes:
assert isinstance(payload, Struct)
tag_field = type(payload).__name__
payload = Raw(plbytes)
msg = Msg(Header('mymsg', tag_field), payload)
return _root_enc.encode(msg)
if __name__ == '__main__':
with init([[Point]]):
# arbitrary struct payload case
send = Point(0, 1)
rx = decmsg(encmsg(send))
assert send == rx
# arbitrary dict payload case
send = {'x': 0, 'y': 1}
rx = decmsg(encmsg(send))
assert send == rx
I guess for me the more ideal default would be that the/some standard decoder is capable of handling tagged union structs and the built-in type set (which I've probably emphasized ad nauseam at this point π). So for example I could still do my Msg.header
business to explicitly limit which message types are allowed in my IPC protocol, but also be able to create a decoder that can (recursively) unwrap embedded structs when needed, instead of trying to do it myself in python.
But, as a (short term) solution I guess the above could be a way to get what I want?
The even more ideal case for me would be that you could embed tagged structs inside other std container data types (dict
, list
, etc.) and then as an option, a (default) tagged struct + built-ins decoder would be available to just take care of decoding everything automatically in some arbitrary serialization object-frame when needed.
Heh, actually the more I think about this context-oriented msg type decoding policy, the more i like it. This kind of thing would play super well with structured concurrency.
msg: Msg
with open_msg_context(
types=[IOTStatustMsg, CmdControlMsg],
capability_uuid='sd0-a98sdf-9a0ssdf'
) as decoder:
# this will simply log an error on non-enabled payload msg types
payload = decoder.decode(msg)
Sorry, there's a lot above, I'll try to respond to what I think are your current issues.
Like, does a struct always have to be the outer most decoding type?
What's more odd to me is that you can support Structs that contain dicts but not the other way around with tagged structs? Seems to me it should be possible to support a dict[str, Struct] where Struct is tagged?
This does work. All types are fully composable, there is no limitation in msgspec requiring structs be at the top level, or that structs can't be subtypes in containers. dict[str, SomeStructType]
or dict[str, Union[Struct1, Struct2, ...]]
fully work fine. If you have a reproducible example showing otherwise I'd be happy to take a look.
Works just fine, but if you try
Union[Point] | list | set
orUnion[Point] | dict
is where you run into problems.. TypeError raised by the Union
Side note - when posting comments referring to errors, it's helpful to include the full traceback so we're all on the same page. Right now I'm left guessing what you're seeing raising the type error.
First, there's no difference in encoding/decoding support between Unions of tagged structs and structs in general. Also, Union[SomeType]
is always the same as just SomeType
, no need for the extra union. So your simplified examples are:
import msgspec
from typing import Union
class Point(msgspec.Struct):
x: int
y: int
for typ in [Union[Point, list, set], Union[Point, dict], Union[int, list, dict]]:
print(f"Trying a decoder for {typ}...")
try:
msgspec.json.Decoder(typ)
except TypeError as exc:
print(f" Failed: {exc}")
else:
print(" Succeeded")
This outputs:
Trying a decoder for typing.Union[__main__.Point, list, set]...
Failed: Type unions may not contain more than one array-like (list, set, tuple) type - type `typing.Union[__main__.Point, list, set]` is not supported
Trying a decoder for typing.Union[__main__.Point, dict]...
Failed: Type unions may not contain both a Struct type and a dict type - type `typing.Union[__main__.Point, dict]` is not supported
Trying a decoder for typing.Union[int, list, dict]...
Succeeded
Note that the error is coming from creating the msgspec.json.Decoder
, not from creating the Union
itself. The error messages are echoing the restrictions for unions that are described in the docs (https://jcristharif.com/msgspec/usage.html#union-optional).
In both cases the issue is that the union contains mutiple Python types that all map to the same JSON type with no way to determine which one to decode into. Both list
and set
map to JSON arrays - we can't support both in a union since this would lead to ambiguity when decoding and a JSON array is encountered. Same for Point | dict
- both python objects encode as JSON objects, there's no efficient way to determine which type to decode into. int | list | dict
is fine, since each of these python types maps to a different JSON type. Tagged Unions provides an efficient way to determine the type to decode at runtime, which is why only tagged structs can coexist within the same union.
I guess for me the more ideal default would be that the/some standard decoder is capable of handling tagged union structs and the built-in type set
This is not possible, for the same reason as presented above. msgspec
forbids ambiguity. Say we try to support what you're asking, given the following schema:
import msgspec
from typing import Any
class Point(msgspec.Struct):
x: int
y: int
dec = msgspec.json.Decoder(Point | Any) # right now this works, but ignores the struct completely since `Any` is present
Given a message like {"x": 1, "y": 2}
, you might expect this to return a Point
, since it matches the Point
schema. But what if we get a message like {"x": 1, "y": 2.0}
? Or {"x": 1}
? These messages don't match Point
, do we error? Or do we return a dict? What if the error is more nested down several sub-objects? Do we have to backtrack then decode into an Any
type? And really, it's likely that the faulty messages are a type error in the sender, not a distinct new message that should be decodd separately. All of this is ambiguous, and can't be done efficiently, which is why msgspec
forbids it.
I guess for me the more ideal default would be that the/some standard decoder is capable of handling tagged union structs and the built-in type set
Knowing nothing about what you're actually trying to achieve here, why not just define an extra Struct
type in the union that can wrap Any
.
import msgspec
from typing import Any, Union
class Msg(msgspec.Struct, tag=True):
pass
class Msg1(Msg):
x: int
y: int
class Msg2(Msg):
a: int
b: int
class Custom(Msg):
obj: Any
enc = msgspec.json.Encoder()
dec = msgspec.json.Decoder(Union[Msg1, Msg2, Custom])
def encode(obj: Any) -> bytes:
if not isinstance(obj, Msg):
obj = Custom(obj)
return enc.encode(obj)
def decode(buf: bytes) -> Any:
msg = dec.decode(buf)
if isinstance(msg, Custom):
return msg.obj
return msg
buf_msg1 = encode(Msg1(1, 2))
print(buf_msg1)
print(decode(buf_msg1))
buf_custom = encode(["my", "custom", "message"])
print(buf_custom)
print(decode(buf_custom))
Output:
b'{"type":"Msg1","x":1,"y":2}'
Msg1(x=1, y=2)
b'{"type":"Custom","obj":["my","custom","message"]}'
['my', 'custom', 'message']
Note that the builtin message types (Msg1
, Msg2
) will only be decoded properly if they are top-level objects, since that's what the provided schema expects. Custom messages thus should only be composed of builtin types (dict/list/...) but can then be unambiguously handled. If you also want to handle e.g. lists of the above at the top level (or whatever) you could add that to the union provided to the decoder as well:
MsgTypes = Union[Msg1, Msg2, Custom]
# Decoder expects either one of the above msg types, or a list of the above msg types
decoder = msgspec.json.Decoder(Union[MsgTypes, list[MsgTypes]])
In the future, large issue dumps like this that are rapidly updated are hard to follow as a maintainer. If you expect a concise and understanding response from me, please put in the effort to organize and present your thoughts in a cohesive manner. While the examples presented in this blogpost aren't 100% relevant for this repository, the general sentiment of "users should provide clear, concise, and reproducible examples of what their issue is" is relevant.
If you expect a concise and understanding response from me, please put in the effort to organize and present your thoughts in a cohesive manner.
My apologies, I didn't know the root issue that I was seeing at outset, it's why I've tried to update things as I've discovered both using the lib and seeing what's possible through tinkering.
Also a lot of this is just thinking out loud as a new user, my apologies if that's noisy, hopefully someone else will find it useful if they run into a similar issue.
The main issue I was confused by was mostly this (and i can move this to the top for reference if you want):
msgspec.Struct
sI do think making some examples of the case I'm describing would be super handy to have in the docs as maybe more of an advanced use case?
The general sentiment of "users should provide clear, concise, and reproducible examples of what their issue is" is relevant.
Totally, originally I thought this was a simple question and now I realize it's a lot more involved; I entirely mis-attributed the real problem to something entirely different, hence my original issue title being ill-informed π
In summary, my main issue was more or less addressed in your answer here, which is what I also concluded:
Note that the builtin message types (Msg1, Msg2) will only be decoded properly if they are top-level objects, since that's what the provided schema expects. Custom messages thus should only be composed of builtin types (dict/list/...) but can then be unambiguously handled. If you also want to handle e.g. lists of the above at the top level (or whatever) you could add that to the union provided to the decoder as well:
In other words you can't pass in an arbitrary: dict[dict[dict, Struct]]]
and expect the embedded Struct
to be decoded without defining the exact schema hierarchy (at least leading to the Struct
field) ahead of time. So in some sense this also is similar to my question in #25.
Knowing nothing about what you're actually trying to achieve here, why not just define an extra Struct type in the union that can wrap Any.
So I think this is pretty similar to what i presented in the embedded Raw
-payload example i put above, I was just originally look for any tagged union struct, anywhere in the encoded data to be automatically decoded no matter where is was situated in the composed data structure hierarchy.
So really, I guess what I am after now is some way to dynamically describe such schemas, maybe even during a struct-msg flow.
Again my apologies about this being noisy, not well formed, ill-attributed; I really just didn't know what the real problem was.
@jcrist I updated the description to include the summary of everything, hopefully that makes all the noise, back and forth, more useful to onlookers π
To just finally summarize and answer all questions you left open for me:
This does work. All types are fully composable, there is no limitation in msgspec requiring structs be at the top level, or that structs can't be subtypes in containers. dict[str, SomeStructType] or dict[str, Union[Struct1, Struct2, ...]] fully work fine. If you have a reproducible example showing otherwise I'd be happy to take a look.
Yes, this does work as long if you specify the schema ahead of time, but even still it's not clear to me how you would use some "top level" decoder to decode non-static-schema embedded Struct
types. So you have to either know the schema or you have to create some dynamic decoding system as I showed in my longer example.
Note that the error is coming from creating the msgspec.json.Decoder, not from creating the Union itself.
Agreed, I mis-attributed the error: msgspec.msgpack.Decoder(Struct | Any)
works fine.
In both cases the issue is that the union contains mutiple Python types that all map to the same JSON type with no way to determine which one to decode into. Both list and set map to JSON arrays - we can't support both in a union since this would lead to ambiguity when decoding and a JSON array is encountered. Same for Point | dict - both python objects encode as JSON objects, there's no efficient way to determine which type to decode into.
Agreed, but with the case of tagged Struct
this isn't true any more right because you can check for the tag_field
and decide if it matches one in your struct registry no?
Tagged Unions provides an efficient way to determine the type to decode at runtime, which is why only tagged structs can coexist within the same union.
Ok so this sounds like what I'm asking for is supposed to work right?
This is not possible, for the same reason as presented above. msgspec forbids ambiguity. Say we try to support what you're asking, given the following schema:
But then you say it isn't and give an example with a non-tagged-Struct
?
Knowing nothing about what you're actually trying to achieve here, why not just define an extra Struct type in the union that can wrap Any
Yes, this is more or less what I concluded except using Raw
is more general and allows the default to be Any
and using a header to specify custom/filtered struct types.
What if the error is more nested down several sub-objects? Do we have to backtrack then decode into an Any type? And really, it's likely that the faulty messages are a type error in the sender,
So i guess the problem here would be decode aliasing due to a tag field collision? I can see why you might was to just sidestep this entirely.
What if the error is more nested down several sub-objects? Do we have to backtrack then decode into an Any type? And really, it's likely that the faulty messages are a type error in the sender,
So i guess the problem here would be decode aliasing due to a tag field collision? I can see why you might was to just sidestep this entirely.
Just as one final proposal, couldn't we just presume if you find a {"type": "CustomStruct"}
that it should be decoded to CustomStruct
or error and if it turns out that was an error by the sender having a "type"
(or wtv tag_field
is) key, then you just throw an equivalent error?
DecodeError(f"Can't decode {msgpack_obj} to type 'CustomStruct' did you send an object with '{tag_field}' set?")
And then the user will know either the serialized object is malformed or there is a collision they have to work around by changing the tag_field
setting?
My use case: handling an IPC stream of arbitrary object messages, specifically with
msgpack
. I desire to useStruct
s for custom object serializations that can be passed between memory boundaries.My presumptions were originally:
msgpack
bytes and taggedmsgspec.Structs
{"type": "CustomStruct", "field0": "blah"}
) and automatically know that the embeddedmsgpack
object is one of our custom tagged structs and should be decoded as aCustomStruct
.Conclusions
Based on below thread:
Union
Decoder(Any | Struct)
won't work even for top levelStruct
s in themsgpack
frameThis took me a (little) while to figure out because the docs didn't have an example for this use case, but if you want to create a
Decoder
that will handle aUnion
of tagged structs and it will still also process the standard built-in type set, you need to specify the subset of the std types that don't conflict withStruct
as per @jcrist's comment in the section starting with:So
Decoder(Any | MyStructType)
will not work.I had to dig into the source code to figure this out and it's probably worth documenting this case for users?
Alternative solutions:
It seems there is no built-in way to handle an arbitrary serialization encode-stream that you wish to decode into the default set as well as be able to decode embedded tagged
Struct
types.But, you can do a couple other things inside custom codec routines to try and accomplish this:
create a custom boxed
Any
struct type, as per @jcrist's comment under the section starting with:consider creating a top-level boxing
Msg
type and then usingmsgspec.Raw
and a custom decoder table to decode payloadmsgpack
data as in my example below