python / cpython

The Python programming language
https://www.python.org
Other
62.51k stars 30.01k forks source link

Subclassed dataclass default is parent value when child forgets to type annotate. #123269

Open jstjohn opened 1 month ago

jstjohn commented 1 month ago

Bug report

Bug description:

This is either a bug, or at least very confusing.

The biggest bug is highlighted in the test below. Note how if you forget to add a type hint to your overriding attribute, you end up with the parent value still rather than the child value. On the other hand if you include any type hint, you get the child value as expected.

from dataclasses import dataclass

@dataclass
class DBaseWithType:
    arg: int = 5

@dataclass
class DChildNoType(DBaseWithType):
    arg = 7

@dataclass
class DChildWithType(DBaseWithType):
    arg:int = 7

@dataclass
class DChildWithArbitraryType(DBaseWithType):
    arg:set = 7

assert DChildWithType().arg == 7, "Passes, this is updated to 7"
assert DChildWithArbitraryType().arg == 7, "Passes, this is updated to 7"
assert DChildNoType().arg == 7, "Fails, this is still 5"

If on the other hand you do not annotate the parent type, repr is still broken, but at least you get the child atribute out in either case.

from dataclasses import dataclass

@dataclass
class DBaseNoType:
    arg = 5

@dataclass
class DChildNoType(DBaseNoType):
    arg = 7

@dataclass
class DChildWithType(DBaseNoType):
    arg:int = 7

@dataclass
class DChildWithArbitraryType(DBaseNoType):
    arg:set = 7

assert DChildWithType().arg == 7, "Passes, this is updated to 7"
assert DChildWithArbitraryType().arg == 7, "Passes, this is updated to 7"
assert DChildNoType().arg == 7, "Passes, this is updated to 7"
assert repr(DChildWithArbitraryType()) == "DChildWithArbitraryType(arg=7)"
assert repr(DChildWithType()) == "DChildWithType(arg=7)"
assert repr(DChildNoType()) == "DChildNoType(arg=7)", "Fails, this shows up as DChildNoType() and doesn't display the updated arg, on the other hand neither does DBaseNoType()"

CPython versions tested on:

3.10, 3.11

Operating systems tested on:

Linux, macOS

sobolevn commented 1 month ago

Quoting docs:

The @dataclass decorator examines the class to find fields. A field is defined as a class variable that has a type annotation.

https://docs.python.org/3/library/dataclasses.html#dataclasses.dataclass

@dataclass
class DChildNoType(DBaseWithType):
    arg = 7

is not a dataclass field, but is a class attribute.

jstjohn commented 1 month ago

I see. In general this is kind of tricky. I know enough python to think that type hints are just hints, so when they have an actual impact on functionality like this I'm surprised. I will be honest though, I did not carefully read the docs on the @dataclass decorator. I do know that I am not the only person who was very surprised by this behavior though.

jstjohn commented 1 month ago

Also from a user of type-hints perspective, it's common to declare a type on the first instance of a variable, and then not declare it on subsequent instances. So I think me being on autopilot there was part of what lead me to leave the type hint off of subsequent child dataclasses. Since this is such surprising behavior, I would argue that it should be dealt with in some way. Maybe a warning (or error?) if a user includes a class attribute that shares the same name as a field when doing a dataclass annotation? Is there a good use case that would make this warning/error approach a bad idea? It would have saved me quite a bit of time as a developer, and is the only instance I'm currently aware of where type hints have significant functional consequences on the runtime behavior.

jstjohn commented 1 month ago

@sobolevn what are your thoughts on this? I think it would be safer/more helpful to not allow a non-field attribute to override a field in a dataclass because of this behavior. An error or warning message would have been very helpful to me. Something that checks to see if a user is re-using the same variable name as a field that was defined on the parent dataclass. Is there a reason to allow a user to do this? I'm not even sure how you would access the attribute you assigned since it seems to get overridden by the parent classes field in the introduced init function. It leads to an inverted inheritance pattern in practice.

jstjohn commented 1 month ago

There are already errors that get raised when you make some mistakes related to this. For example if you try to assign a field to an attribute without a type you get an error. Would it be hard to also check to see if the parent class has a field that exists and matches the name of a non-field attribute when wrapping a new class? Not sure if the wrapper would have access to that level of information but it would be really nice...

In [1]: from dataclasses import dataclass, field

In [2]: @dataclass
   ...: class C0:
   ...:     arg = field(default=2)
   ...: 
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[2], line 2
      1 @dataclass
----> 2 class C0:
      3     arg = field(default=2)

File /usr/lib/python3.10/dataclasses.py:1184, in dataclass(cls, init, repr, eq, order, unsafe_hash, frozen, match_args, kw_only, slots)
   1181     return wrap
   1183 # We're called as @dataclass without parens.
-> 1184 return wrap(cls)

File /usr/lib/python3.10/dataclasses.py:1175, in dataclass.<locals>.wrap(cls)
   1174 def wrap(cls):
-> 1175     return _process_class(cls, init, repr, eq, order, unsafe_hash,
   1176                           frozen, match_args, kw_only, slots)

File /usr/lib/python3.10/dataclasses.py:979, in _process_class(cls, init, repr, eq, order, unsafe_hash, frozen, match_args, kw_only, slots)
    977 for name, value in cls.__dict__.items():
    978     if isinstance(value, Field) and not name in cls_annotations:
--> 979         raise TypeError(f'{name!r} is a field but has no type annotation')
    981 # Check rules that apply if we are derived from any dataclasses.
    982 if has_dataclass_bases:
    983     # Raise an exception if any of our bases are frozen, but we're not.

TypeError: 'arg' is a field but has no type annotation
sobolevn commented 1 month ago

I am still not sure about this. Sounds like a warning might be an idea worth trying, but on the other hand it might trigger on a lot of valid existing code. cc @ericvsmith and @carljm

jstjohn commented 1 month ago

I'm curious if you are aware of any specific valid code that uses the pattern of having a non-field attribute in a child dataclass that matches the name of a field attribute in the parent dataclass @sobolevn? It seems like the new attribute you declare in the child class ends up just getting masked by the parent in a way that it becomes inaccessible. It's like a no-op at best but I would guess confusing to most users. I can't think of a use case for this pattern yet.

If an error caused failures in valid code, I would guess they are all instances of the code only happening to be valid because the child class either copies the same attribute value from the parent so that the masked attribute is the same, or the attribute that's masked is not used (or is equally valid for both options for the value). But yeah I am far from an expert in all the things going on behind the scenes so don't put too much weight on this opinion!

ericvsmith commented 1 month ago

How would you even test for this? I don't think the slowdown in dataclass creation would be worth the check.

carljm commented 1 month ago

Just to be clear, the unannotated class attribute doesn't disappear into thin air, it's just a normal class attribute, not handled as a field by dataclasses (so it doesn't change the default value of the field inherited from the parent dataclass). It's accessible in the same way any other class attribute would be, by accessing it on the class:

>>> from dataclasses import dataclass
>>> @dataclass
... class DBaseWithType:
...     arg: int = 5
...
>>> @dataclass
... class DChildNoType(DBaseWithType):
...     arg = 7
...
>>> DChildNoType.arg
7
ericvsmith commented 1 month ago

Also, since this is currently valid code, I don't think adding a warning or making it an error is appropriate. I'll grant that it's probably not used very often, but it's entirely legal.

jstjohn commented 1 month ago

@carljm interesting! While your example shows that the attribute has been added to your child class, it does get masked as soon as you init. This is confusing behavior until you get into the guts of how things are implemented, what the definition of a field vs an attribute is, etc. It does not make for a very easy experience for a user.

In [1]: from dataclasses import dataclass

In [2]: @dataclass
   ...: class DBaseWithType:
   ...:     arg: int = 5
   ...: 

In [3]: @dataclass
   ...: class DChildNoType(DBaseWithType):
   ...:     arg = 7
   ...: 

In [4]: DChildNoType.arg
Out[4]: 7

In [5]: DChildNoType.arg == DChildNoType().arg
Out[5]: False

@ericvsmith While you're right that the above is syntactically valid python code, so is assigning a field to an untyped attribute as far as I'm concerned, which you already raise an exception for (see several messages earlier). I'm not arguing that you should not have this exception, to be clear, it's helpful! What I'm arguing for is another exception that would be in the same category of an error that improves usability for this particular package. I'm not sure how you could/should fix this, but I do think this is a pretty gnarly issue. Here's a PEP saying that "Python will remain a dynamically typed language, and the authors have no desire to ever make type hints mandatory, even by convention." https://peps.python.org/pep-0484/#non-goals

justanotherscriptkiddienerd commented 1 month ago

True tho

ericvsmith commented 1 month ago

@jstjohn: annotations (aka type hints) are absolutely mandatory for dataclasses. PEPs are historical documents, and are only valid at a point in time.

jstjohn commented 1 month ago

@ericvsmith sure. My point was more about how arguing for technically valid Python code isn't necessarily the right way to go in this case, due to the peculiarities of data classes.

Question: what if you went with your statement all the way then about how type annotations are absolutely manditory in this case: print a warning message whenever a data class wrapper encounters a non-annotated attribute with a pointer to how the code is probably not doing what they think it is.

If you could find a way to specifically figure out if an attribute shares the name of a parent field, that would be best, but a warning when unannotated attributes are encountered would still be very useful, especially if there's a nice way to turn the warning off if a user knows what they're doing?

The warning could include something about how they are not processed by the data class wrapper into init, and will be overridden at init time by any fields that share that name, if such a field exists in the parent class hierarchy?