rnag / dataclass-wizard

A simple, yet elegant, set of wizarding tools for interacting with Python dataclasses.
Other
174 stars 22 forks source link

[BUG] Incomplete Support for Union or | Type Deserialization in Common Python Types (e.g., `int`, `Enum`, `Literal`) #67

Open StripedMonkey opened 2 years ago

StripedMonkey commented 2 years ago

Description

Using the dataclass-wizard to deserialize a YAML file fails when trying to use string literals as field options. In my particular case, a value in the config can be set manually or derived automatically depending on user preference. In this instance, the only valid string value for x should be Auto. Any other value should fail to deserialize. However, since it appears that the parser is unaware of the possible equality of str/num/bytes to a Literal this fails.

What I Did

>>> from dataclass_wizard import YAMLWizard
>>> from typing import Literal
>>> from dataclasses import dataclass
>>> @dataclass
... class Example(YAMLWizard):
...     x: Literal["Auto"] | float
... 
>>> config_file = """
... x: Auto
... """
>>> Example.from_yaml(config_file)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/.venv/lib/python3.10/site-packages/dataclass_wizard/wizard_mixins.py", line 136, in from_yaml
    return fromdict(cls, o) if isinstance(o, dict) else fromlist(cls, o)
  File "~/.venv/lib/python3.10/site-packages/dataclass_wizard/loaders.py", line 536, in fromdict
    return load(d)
  File "~/.venv/lib/python3.10/site-packages/dataclass_wizard/loaders.py", line 619, in cls_fromdict
    cls_kwargs[field_name] = field_to_parser[field_name](
  File "~/.venv/lib/python3.10/site-packages/dataclass_wizard/parsers.py", line 250, in __call__
    raise ParseError(
dataclass_wizard.errors.ParseError: Failure parsing field `x` in class `Example`. Expected a type [typing.Literal['Auto'], <class 'float'>], got str.
  value: 'Auto'
  error: Object was not in any of Union types
  tag_key: '__tag__'
  json_object: '{"x": "Auto"}'
mia-n commented 1 year ago

I don't think this has anything to do with Literals - anything that needs to be deserialized in a union is not being deserialized.

>>> from dataclasses import dataclass
>>> from dataclass_wizard import JSONWizard
>>> @dataclass
... class B:
...     thing: str
... 
>>> 
>>> @dataclass
... class A(JSONWizard):
...     one_or_more: B | list[B]
... 
>>> 
>>> result = A.from_dict({
...     'one_or_more': {
...         'thing': 'value'
...     }
... })
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.10/site-packages/dataclass_wizard/loaders.py", line 536, in fromdict
    return load(d)
  File "/usr/local/lib/python3.10/site-packages/dataclass_wizard/loaders.py", line 619, in cls_fromdict
    cls_kwargs[field_name] = field_to_parser[field_name](
  File "/usr/local/lib/python3.10/site-packages/dataclass_wizard/parsers.py", line 250, in __call__
    raise ParseError(
dataclass_wizard.errors.ParseError: Failure parsing field `one_or_more` in class `A`. Expected a type [<class '__main__.B'>, <class 'list'>], got dict.
  value: {'thing': 'value'}
  error: Object was not in any of Union types
  tag_key: '__tag__'
  json_object: '{"one_or_more": {"thing": "value"}}'
>>> 
fakuivan commented 1 year ago

Got the same issue here. Mixing primitives with dataclasses in unions seems to be the problem:

from dataclass_wizard import fromdict
from dataclasses import dataclass

@dataclass
class MoreDetails:
    arg: str

@dataclass
class Config:
    # Unions with primitives and dataclasses
    entry: str | MoreDetails | int

# Works
fromdict(Config, {
    "entry": "a"
})

# Works too
fromdict(Config, {
    "entry": 1
})

# Fails, fails to interpret the dict as a dataclass
fromdict(Config, {
    "entry": {"arg": "str"}
})
Traceback (most recent call last):
  File "/mnt/c/Users/fakui/Seafile/Programming/projects/cc_crapton/./ha_modbus_mqtt/dataclasswz_bug.py", line 29, in <module>
    fromdict(Config2, {
  File "/home/fakui/.local/lib/python3.10/site-packages/dataclass_wizard/loaders.py", line 536, in fromdict
    return load(d)
  File "/home/fakui/.local/lib/python3.10/site-packages/dataclass_wizard/loaders.py", line 619, in cls_fromdict
    cls_kwargs[field_name] = field_to_parser[field_name](
  File "/home/fakui/.local/lib/python3.10/site-packages/dataclass_wizard/parsers.py", line 250, in __call__
    raise ParseError(
dataclass_wizard.errors.ParseError: Failure parsing field `entry` in class `Config2`. Expected a type [<class '__main__.MoreDetails'>, <class '__main__.Config'>], got dict.
  value: {'arg': '1'}
  error: Object was not in any of Union types
  tag_key: '__tag__'
  json_object: '{"entry": {"arg": "1"}}'

dataclass-wizard version: 0.22.2

StripedMonkey commented 9 months ago

It appears you might be unfamiliar with the GitHub workflow. In order to propose your changes you must fork this repository and submit those changes to your fork. Once you have done so there will appear a button on your fork that asks if you want to open a PR.

TL;Dr you need to fork, submit your changes to the fork, then create a PR. It's not normal for you to have the ability to commit to a repository you are not a maintainer of.

mikeweltevrede commented 9 months ago

@StripedMonkey thanks for the quick reply! Indeed I was unfamiliar with this way of working. I forked the repo and created #111.

rnag commented 1 week ago

I was incorrect in thinking #111 would fix the issue. After looking into this a bit deeper, it seems I figured out the issue, but it's a tough fix.

The problem is that if you annotate like Literal['abc'] | float, currently the parser is expecting, literally a string of abc or a float value like 12345.

The failure is happening because you get float value '12345' in string from YAML, not a float at all.

Here's a sample code to demonstrate:

from dataclasses import dataclass
from typing import Literal

from dataclass_wizard import JSONWizard

@dataclass
class MyClass(JSONWizard):
    literal_or_float: Literal['abc'] | float

# OK
c = MyClass.from_dict({"literal_or_float": "abc"})
print(repr(c))

# OK
c = MyClass.from_dict({"literal_or_float": 1.23})
print(repr(c))

# NOK (Raises error)
c = MyClass.from_dict({"literal_or_float": "1.23"})

This results in error on the last line:

dataclass_wizard.errors.ParseError: Failure parsing field `literal_or_float` in class `MyClass`. Expected a type [typing.Literal['abc'], <class 'float'>], got str.
  value: '1.23'
  error: Object was not in any of Union types
  tag_key: '__tag__'
  json_object: '{"literal_or_float": "1.23"}'

As you can see, it works for first 2 cases, but it fails because a string value of "1.23" is being attempt to be deserialized. This normally works if field was annotated as just a float, because there's only one type annotation, and it just calls the function float to convert any incoming value to that type, which works nicely in single-type case.

BUT, when you add a Union with another type, like say a Literal["ABC"], then it gets more complicated to handle.

I want to address this in a way that doesn't impact performance (too much), so I'll have to think about this. The thing is if a string is coming in as "1.23" and a type for a field is like Literal['abc'] | float, the simplest and possibly most efficient approach would to -- instead of checking for direct match against type annotation(s) in the Union -- to just try/catch conversion for each Union argument, and then handle the error at the end.

I'll have to look into this some more, but I'm re-opening this issue because it's a known/current problem.

rnag commented 1 week ago

Got the same issue here. Mixing primitives with dataclasses in unions seems to be the problem:

from dataclass_wizard import fromdict
from dataclasses import dataclass

@dataclass
class MoreDetails:
    arg: str

@dataclass
class Config:
    # Unions with primitives and dataclasses
    entry: str | MoreDetails | int

# Works
fromdict(Config, {
    "entry": "a"
})

# Works too
fromdict(Config, {
    "entry": 1
})

# Fails, fails to interpret the dict as a dataclass
fromdict(Config, {
    "entry": {"arg": "str"}
})
Traceback (most recent call last):
  File "/mnt/c/Users/fakui/Seafile/Programming/projects/cc_crapton/./ha_modbus_mqtt/dataclasswz_bug.py", line 29, in <module>
    fromdict(Config2, {
  File "/home/fakui/.local/lib/python3.10/site-packages/dataclass_wizard/loaders.py", line 536, in fromdict
    return load(d)
  File "/home/fakui/.local/lib/python3.10/site-packages/dataclass_wizard/loaders.py", line 619, in cls_fromdict
    cls_kwargs[field_name] = field_to_parser[field_name](
  File "/home/fakui/.local/lib/python3.10/site-packages/dataclass_wizard/parsers.py", line 250, in __call__
    raise ParseError(
dataclass_wizard.errors.ParseError: Failure parsing field `entry` in class `Config2`. Expected a type [<class '__main__.MoreDetails'>, <class '__main__.Config'>], got dict.
  value: {'arg': '1'}
  error: Object was not in any of Union types
  tag_key: '__tag__'
  json_object: '{"entry": {"arg": "1"}}'

dataclass-wizard version: 0.22.2

@fakuivan That is actually not the root issue. Dataclass in Union Types is possible with the Dataclass Wizard, it just requires correct usage of tag field as shown below:

from dataclasses import dataclass
from typing import Literal

from dataclass_wizard import JSONWizard

@dataclass
class MyClass(JSONWizard):
    class Meta(JSONWizard.Meta):
        tag_key = 'type'
        auto_assign_tags = True

    literal_or_float: Literal['abc'] | float | 'AClass'

@dataclass
class AClass(JSONWizard):
    something: str

c = MyClass.from_dict({"literal_or_float": {"type": "AClass", "something": "3"}})
print(repr(c))
# OK
c = MyClass.from_dict({"literal_or_float": "abc"})
print(repr(c))

# OK
c = MyClass.from_dict({"literal_or_float": 1.23})
print(repr(c))

# NOK (Raises error)
# c = MyClass.from_dict({"literal_or_float": "1.23"})
# print(repr(c))

@mia-n Similar approach can be used with the example you had. That code runs fine with some modification.

The root issue is that Dataclass Wizard does not attempt a "naive" approach when it comes to Union types -- it does not try to parse a value as each of Union types in a round-robin fashion. Some libraries like pydantic do this, however Dataclass Wizard does not, mainly for performance reasons.

That is why "tag" key is required when dataclasses are in Unions. Since each Union type is not attempted in a "try-catch" fashion, the "tag" key helps identify exactly which dataclass to parse as. This is more performant, and also helps in cases when two dataclasses in Union share the same fields and types, for example.

rnag commented 1 week ago

Adding my thoughts here in the form of pseudo-code. I think one approach can be to create a class UnionChecker or similar in loaders.py or a new module. It would be modeled similar to LoadMixIn and have a method for each common type that would be in a Union or | in a type annotation, such as int, float, enum, datetime, ... and so on.

Again, pseudo-code of how that could look:

class UnionChecker:

  def parse_int_if_valid(o: int | str | float) -> int | None:
    if type(o) is int: return o
    if type(o) is float: return int(round(o))
    if type(o) is str: return int(o) if o.removeprefix('-').isdigit() else None
    # catch-all return statement
    return None

  def parse_enum_if_valid(o: Any) -> Enum | None:
    if type(o) is Enum Class: return o
    return enum_class(o) if o is a valid value for Enum Class else None
    # catch-all return statement
    return None

Then, in parsers.py where we call Parser.__contains__ in UnionParser.__call__:

class UnionParser:
    ...
    def __call__(self, o: Any) -> Optional[T]:
        ...
        for parser in self.parsers:
   >        if o in parser:
                return parser(o)

We can replace that check with something like:

for parser in parsers:
  possible_value = parse_to_type_if_valid(o, base_type)
  if possible_value is not None:
    return possible_value

This should give a good compromise b/w efficiency and also fix the parsing support for Union, which is currently not fully supported in this library, and which I would like to address.

Some Benchmarks

Also, just in case someone tells me I'm worried about micro-optimizations which is possible, I made a small benchmark to test validating vs blind parsing for common types like Enum and int:

from enum import Enum
from timeit import timeit

class MyEnum(Enum):
    VAL_1 = 'ONE PLEASE'
    VAL_2 = 'TWO IS GREAT'
    VAL_3 = "THREE'S AWESOME"
    VAL_4 = 97

def is_enum1(o):
    try:
        _ = MyEnum(o)
        return True
    except ValueError:
        return False

def is_enum2(o):
    return o in MyEnum._value2member_map_

def is_int1(o):
    try:
        _ = int(o)
        return True
    except ValueError:
        return False

def is_int2(o: str):
    return o.removeprefix('-').isdigit()

def is_float1(o):
    try:
        _ = float(o)
        return True
    except ValueError:
        return False

def is_float2(o: str):
    return o.removeprefix('-').replace('.', '', 1).isdigit()

n = 1_000

print(timeit('is_enum1("TWO IS GREAT")', globals=globals(), number=n))
print(timeit('is_enum2("TWO IS GREAT")', globals=globals(), number=n))
print(timeit('is_float1("54.321")', globals=globals(), number=n))
print(timeit('is_float2("54.321")', globals=globals(), number=n))
print(timeit("is_int1('123')", globals=globals(), number=n))
print(timeit("is_int2('123')", globals=globals(), number=n))
# 0.00019675004296004772
# 4.5042019337415695e-05
# 6.279093213379383e-05
# 8.941697888076305e-05
# 8.983299994724803e-05
# 4.8417001380585134e-05
print()
print(timeit('is_enum1("OOPS!!")', globals=globals(), number=n))
print(timeit('is_enum2("OOPS!!")', globals=globals(), number=n))
print(timeit('is_float1("OOPS!!")', globals=globals(), number=n))
print(timeit('is_float2("OOPS!!")', globals=globals(), number=n))
print(timeit("is_int1('1.abc')", globals=globals(), number=n))
print(timeit("is_int2('1.abc')", globals=globals(), number=n))
# 0.0014168331399559975
# 3.700004890561104e-05
# 0.00046787504106760025
# 6.270897574722767e-05
# 0.000634250001894543
# 4.758400245918892e-05

assert is_enum1("OOPS!!") is is_enum2("OOPS!!") is False
assert is_enum1("TWO IS GREAT") is is_enum2("TWO IS GREAT") is True
assert is_float1("OOPS!!") is is_float2("OOPS!!") is False
assert is_float1("54.321") is is_float2("54.321") is True
assert is_int1('123') is is_int2('123') is True
assert is_int1('1.abc') is is_int2('1.abc') is False

Results show that validating values for types like Enum and int is an overall better approach. There are some types like float, for which it might be better to try parsing using float constructor, but overall validating the value/type first seems a solid approach to have.

Hope this helps - I currently don't have time to implement this myself, which is why I'm jotting my thoughts down, in case anyone wants to take a stab at it. I'm also going to update the issue with "help wanted" label. For now, I'm going to focus on addressing other issues with the Dataclass Wizard library, but I'll keep a tab on this issue and will hopefully be back to look at it with fresh eyes sometime soon.