python / cpython

The Python programming language
https://www.python.org
Other
62.59k stars 30.04k forks source link

copy fails on collections.OrderedDict dataclass with required args #105736

Open ringohoffman opened 1 year ago

ringohoffman commented 1 year ago

Bug report

Related: https://github.com/huggingface/transformers/issues/8978

import collections
import copy
import dataclasses

@dataclasses.dataclass
class ModelOutputDictBase(dict):
    a: int
    b: int = 2

# works
copy.deepcopy(
    ModelOutputDictBase(
        a=1,
        b=2,
    )
)

@dataclasses.dataclass
class ModelOutputAllDefaults(collections.OrderedDict):
    a: int = 1
    b: int = 2

# works
copy.deepcopy(
    ModelOutputAllDefaults(
        a=1,
        b=2,
    )
)

@dataclasses.dataclass
class ModelOutput(collections.OrderedDict):
    a: int
    b: int = 2

# fails
copy.deepcopy(
    ModelOutput(
        a=1,
        b=2,
    )
)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 39
     35     a: int
     36     b: int = 2
---> 39 copy.deepcopy(
     40     ModelOutput(
     41         a=1,
     42         b=2,
     43     )
     44 )

File [...lib/python3.8/copy.py:172), in deepcopy(x, memo, _nil)
    170                 y = x
    171             else:
--> 172                 y = _reconstruct(x, memo, *rv)
    174 # If is its own copy, don't memoize.
    175 if y is not x:

File [.../lib/python3.8/copy.py:264), in _reconstruct(x, memo, func, args, state, listiter, dictiter, deepcopy)
    262 if deep and args:
    263     args = (deepcopy(arg, memo) for arg in args)
--> 264 y = func(*args)
    265 if deep:
    266     memo[id(x)] = y

TypeError: __init__() missing 1 required positional argument: 'a'

Your environment

Linked PRs

eendebakpt commented 1 year ago

@ringohoffman I can confirm the deepcopy (or copy.copy) fails on python 3.11 with the example above. What is the reason for mixing the dataclass decorator and the dict or OrderedDict subclassing?

For the OrderedDict subclassing the __reduce__ on the ModelOutput becomes <method '__reduce__' of 'collections.OrderedDict' objects> which is not handling the copy of the dataclass correctly (in combination with the copy._reconstruct). I do not know whether this combination is supposed to work

ringohoffman commented 1 year ago

transformers uses this pattern as part of their ModelOutput base class so you can access elements both by attribute and by item: https://github.com/huggingface/transformers/blob/0c3fdccf2f271fb7c44f6ea6e9f4ee234795f2c5/src/transformers/utils/generic.py#L315-L332

dict and its subclasses have special handling by torch in terms of how it handles splitting tensors that are nested elements of dict across threads and processes: https://github.com/pytorch/pytorch/blob/d0ff640ec865dd8c2898b59f1dab689992ed4621/torch/nn/parallel/scatter_gather.py#L50-L51

sobolevn commented 1 year ago

CC @rhettinger

rhettinger commented 1 year ago

Here's a smaller reproducer:

from pprint import pp
import collections
import copy

class A(dict):
    def __init__(self, a: int, b: int=2):
        self.a = a
        self.b = b

class B(collections.OrderedDict):
    def __init__(self, a: int, b: int=2):
        self.a = a
        self.b = b

pp(A(1, 3).__reduce__())
pp(B(1, 3).__reduce__())

a = copy.copy(A(1, 3))
b = copy.copy(B(1, 3))
rhettinger commented 1 year ago

For copying to work in this case, the __reduce__ method must return a callable that creates an instance but does not run the class's __init__ method since we can't know its signature in advance.

Also, even before copying, the example class isn't functional for the pure python OrderedDict implementation because it overrides __init__ which is needed to set up the underlying data structures. That setup work needs to be moved from __init__ to __new__.

rhettinger commented 1 year ago

@ericsnowcurrently Eric, do you have any thoughts on how we can (and/or whether we should) modify OrderedDict.__reduce__ (in both the Python and C versions) to survive a subclass that overrides __init__, doesn't call the parent __init__, and that has some required arguments?

ericsnowcurrently commented 1 year ago

I don't have much stake in this, but that situation certainly gives me pause. We're already dealing with a dict subclass, which makes things dicey. Furthermore, subclassing is necessarily cooperative, particularly when __init__() is involved. I'd hesitate to accommodate subclasses that don't cooperate. (FWIW, I would have expected dataclasses to do the right thing in the original post. Maybe it can't and the use must supply the proper __init__()?)

hugovk commented 10 months ago

Is there anything left to do here, or can we close this issue?