dataclass copy constructor doesn't recognize default_factory

hsolbrig commented 2 years ago

Bug report The dataclass copy constructor doesn't recognize the default_factory.

@dataclass
class Orig2:
    a: int
    b: int = 1
    c: List[int] = field(default_factory=list)

Copy2 = dataclass(Orig2)

Causes the following error`

Traceback (most recent call last):
  File "/Users/hsolbrig/git/hsolbrig/dataclasstest/src/dataclasscopytest.py", line 53, in <module>
    Copy2 = dataclass(Orig2)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/dataclasses.py", line 1185, in dataclass
    return wrap(cls)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/dataclasses.py", line 1176, in wrap
    return _process_class(cls, init, repr, eq, order, unsafe_hash,
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/dataclasses.py", line 1025, in _process_class
    _init_fn(all_init_fields,
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/dataclasses.py", line 546, in _init_fn
    raise TypeError(f'non-default argument {f.name!r} '
TypeError: non-default argument 'c' follows default argument

Your environment This bug has been shown to exist in versions 3.9 and 3.10. The test above is python 3.10.4.

Using the line numbers in python 3.10.4, this can be fixed with the following change to dataclasses.py line 958:

    for f in cls_fields:
        fields[f.name] = f

        # If the class attribute (which is the default value for this
        # field) exists and is of type 'Field', replace it with the
        # real default.  This is so that normal class introspection
        # sees a real default value, not a Field.
        if isinstance(getattr(cls, f.name, None), Field):
            if f.default is MISSING and f.default_factory is MISSING:      # <-- Add second test
                # If there's no default, delete the class attribute.
                # This happens if we specify field(repr=False), for
                # example (that is, we specified a field object, but
                # no default value).  Also if we're using a default
                # factory.  The class attribute should not be set at
                # all in the post-processed class.
                delattr(cls, f.name)
            else:
                setattr(cls, f.name, f.default_factory if f.default is MISSING else f.default)   # <-- add conditional

hsolbrig commented 2 years ago

Sample test code can be found at https://github.com/hsolbrig/dataclasstest/blob/main/src/dataclasscopytest.py

ericvsmith commented 2 years ago

This isn't a "copy constructor", this is "apply the @dataclass decorator to the same class twice". I think it makes more sense to raise an exception if you try to do this, rather than have it succeed.

Can I ask what your use case is?

ericvsmith commented 2 years ago

The more I think about this, the more I think the right thing to do is prevent dataclass from being called twice on the same class. For example, what would:

@dataclass(repr=False)
@dataclass(repr=True)
class C:
    x: int

result in?

hsolbrig commented 2 years ago

Heavens - please don't do this! This feature is currently being utilized by the pydantic community to allow dataclasses and pydantic to co-exist. It is a wonderful bit of code that keeps us (linkml) from having to choose between the two approaches.

For details, scroll down in https://pydantic-docs.helpmanual.io/usage/dataclasses/ to the section labeled "Convert stdlib dataclasses into pydantic dataclasses". Its use can be found at https://github.com/samuelcolvin/pydantic/blob/master/pydantic/dataclasses.py#L159

I may have used the wrong terminology by calling it a "copy constructor", but if this suggested change is objectionable, please leave the code as it is and we'll work with the pydantic community to figure out an alternative.

ericvsmith commented 2 years ago

I'm not going to do anything rash!

What would pydantic expect from:

@dataclass(repr=False)
@dataclass(repr=True)
class C:
    x: int

Or does it always expect that if called twice on the same class that all of the parameters to dataclass would be the same?

hsolbrig commented 2 years ago

Let me see whether I can get the pydantic authors involved.

WRT the above question, I actually have no idea what to expect. As the old joke goes, Patient: "Doc, it hurts when I do this!" Doc: "Then don't do it!" I've posted a heads up here -- hopefully someone on that side can clock in.

hsolbrig commented 2 years ago

Just mulling it over, my recollection is that repr=False would be superfulous in this case, as the presence of a __repr__ function is the same as repr=False, no?

samuelcolvin commented 2 years ago

Hi, pydantic author here.

Thanks so much @hsolbrig for bringing this up.

Ignoring pydantic for a moment, I completely agree that it would make sense for

@dataclass(repr=False)
@dataclass(repr=True)
class C:
    x: int

to be illegal. Or to put it differently - an explicit error would be better than undefined behaviour, wrong behaviour or a less clear error later on.

The reason pydantic cares about this is because people want to convert a standard dataclass to a "pydantic dataclass" - e.g. a dataclass with machinery to perform full type validation. But we want the "pydantic dataclass" to be as close as possible to a vanilla standard dataclass.

E.g. we want:

import dataclasses
import pydantic

@dataclasses.dataclass
class Foobar:
    a: str
    b: list[int] = dataclasses.field(default_factory=lambda: [1, 2, 3])

Foobar2 = pydantic.dataclasses.dataclass(Foobar)
f2 = Foobar2(a="a")
print(f)

To work. Currently it doesn't because the new b has no default_factory.

I'd be happy/willing to rewrite how pydantic.dataclasses.dataclass works, but this is our 3rd or even 4th rewrite of pydantic dataclasses which suggests a documented, approved and std lib tested way to do what we're trying to do would be useful.

Is that something you'd consider?

hsolbrig commented 2 years ago

Update on the proposed patch: the last line should read:

                setattr(cls, f.name, f.default_factory() if f.default is MISSING else f.default)   # <-- add conditional

Executing the default_factory vs. just adding it. Will be running more tests to see whether this introduces any other problems

ericvsmith commented 2 years ago

Does pydantic always use the same parameters every time it calls dataclass on the same class?

ericvsmith commented 2 years ago

@hsolbrig: Could you please provide a PR or at least a diff?

hsolbrig commented 2 years ago

Will do -- the fix proposal is going to be a bit more complex than I'd first thought so it will probably be a day or two before I'm sure that everything works.

hsolbrig commented 2 years ago

PR submitted - happy to chat about it, although we will probably want to include @samuelcolvin as well.

carljm commented 2 years ago

I suspect that the preferable behavior might be to prevent dataclass being called twice on the same class, but to provide a supported way to "look before you leap" and check if a class has had the dataclass decorator run on it already, i.e. something like the is_direct_dataclass added in #92406, though I think the implementation should be simpler and just check for "__dataclass_fields__" in cls.__dict__. Pydantic could also just do this check directly with no new API, if "__dataclass_fields__ will only be in the __dict__ of a class that has had the decorator run on it" is a stable guarantee.

I don't think it makes sense to silently allow running the dataclass decorator twice on the same class, where the second run is a no-op. As @ericvsmith already has hinted, this leads to pretty confusing behavior if the params used are not identical for the two calls. You could have a situation where you clearly call the dataclass decorator on a class and say frozen=True, but the resulting class is not frozen, because someone else already called the decorator on it previously. So I think double-running on the same class should be an error.

As far as I can tell, pydantic doesn't really need the ability to re-run the decorator on an existing dataclass -- in that case pydantic is just counting on it doing no harm (but currently it does do harm in any case where field() was used to define a field originally, because the second run won't see the full Field, just the default value left as an attribute on the class.) It seems like pydantic could easily be changed to check first and just skip running the dataclass decorator if the class at hand has already been processed by @dataclass. Am I wrong here?

I guess it's up to pydantic whether in this case it wants to check for a match between the options given to the pydantic decorator and the ones previously used on the already-a-dataclass (this check is possible by introspecting __dataclass_params__) or just allow possible confusion if they don't match.

python / cpython

dataclass copy constructor doesn't recognize default_factory #92052