python / mypy

Optional static typing for Python
https://www.mypy-lang.org/
Other
18.33k stars 2.81k forks source link

Data Class Transforms add `__dataclass_fields__` #15974

Open ikonst opened 1 year ago

ikonst commented 1 year ago

Per PEP 681, data class transforms are defined in terms of dataclasses, but it doesn't mean they make dataclasses.

In particular, transformed classes don't necessarily have __dataclass_fields__ or have is_dataclass return True, but per mypy - they do.

erictraut commented 1 year ago

Where is __dataclass_fields__ documented? I don't see any reference to it in PEP 557 (which introduced dataclass) or in the official Python documentation for dataclasses. I presume that it's an internal implementation detail of the stdlib dataclass. If so, then I'd argue that it shouldn't appear in typeshed stubs, and nothing in the typing specs (including PEP 681) should depend on it.

The only reference to __dataclass_fields__ that I can find is in this typeshed issue and this referenced mypy issue. I don't think I agree with the conclusions that were reached in those threads.

ikonst commented 1 year ago

Yeah, you've got it. Nominally your approach is sound: if it's not in the spec, typeshed shouldn't build a protocol on it. Rather, #14263 should've been the way to go. The typeshed solution leverages a "plugin" behavior anyway (the existence of this undocumented attr) so might as well go "plugin" on this one.

~In practice, libs like pydantic have been synthesizing this attr to make their models act as dataclasses, so there's interest in having is_dataclass serve as an effective (and correct!) type guard for using the other dataclasses utility functions.~ (correction: pydantic actually produces a false positive and would improve if this issue is fixed)

ikonst commented 1 year ago

FWIW, pyright also synthesizes this symbol, and it could be similarly argued that since it's an implementation detail of dataclasses, then pyright should not synthesize it, and then typeshed wouldn't "take advantage" of it (as they did in https://github.com/python/typeshed/issues/9345).

Or we can accept that we've grown to rely on it. 😔

This issue is about whether we should synthesize it indiscriminately for Data Class Transforms. Currently we do, and the following is a false positive:

import typing

_T = typing.TypeVar("_T")

@typing.dataclass_transform()
def create_model(cls: type[_T]) -> type[_T]:
    cls.__init__ = lambda *args, **kwargs: None  # type: ignore
    return cls

@create_model
class C:
    a: int

c = C(a=42)
typing.reveal_type(c.__dataclass_fields__)

Runtime:

    typing.reveal_type(c.__dataclass_fields__)
                       ^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'C' object has no attribute '__dataclass_fields__'

mypy:

file.py:18: note: Revealed type is "builtins.dict[builtins.str, Any]"
Success: no issues found in 1 source file
JelleZijlstra commented 1 year ago

I don't think we should. PEP 681 is a specification for abstractions that work like dataclasses in their public interface; it doesn't require that abstractions that use dataclass_transform mirror private implementation details of dataclasses.

ikonst commented 1 year ago

@JelleZijlstra I agree, and this issue about removing this synthesized attribute in that case, basically

-self._add_dataclass_fields_magic_attribute()
+if self._spec is _TRANSFORM_SPEC_FOR_DATACLASSES:
+    self._add_dataclass_fields_magic_attribute()

At the same time, I did some sleuthing for a related issue, and it would appear that doing that would break many users that have grown to rely on @dataclass_transform to create alternative decorators that ultimately do call into dataclasses.dataclass and thus produce true dataclasses (example). It's not what PEP-681 was intended for, but I found copious examples. So now I don't know (insert "two buttons" meme).

erictraut commented 1 year ago

FWIW, pyright also synthesizes this symbol

I added a synthesized __dataclass_fields__ in pyright because mypy added it and typeshed added a dependency on it. I should have done more research and asked questions before doing this.

Here are some questions that may help inform the best way forward...

  1. Why was the __dataclass_fields__ hack added to typeshed? I presume it was to provide better type checking for replace, asdict, astuple, fields, and is_dataclass.
  2. Was there evidence that developers are frequently passing non-dataclass objects to these functions? In other words, is this a common source of bugs?
  3. Which of these functions (if any) apply to dataclass-like classes? I presume the answer is none.
  4. How frequently are these functions used in real-world code? Is there a way for us to measure this?

Thinking about potential solutions...

  1. We could delete DataclassInstance and all of its uses from typeshed, replacing all instances with object. Both mypy and pyright could then also delete the synthesized __dataclass_fields__. This would decrease type safety for the functions I mentioned above. If they're infrequently used and/or not a common source of bugs, this probably doesn't matter.
  2. We could augment dataclass_transform to support a new parameter that indicates whether a __dataclass_fields__ attribute is synthesized. In other words, it could be specified by the user explicitly.
NeilGirdhar commented 1 year ago

As mentioned here, pyserde, strawberry, ibis would use the parameter mentioned in option 2. In fact, so would flax I believe.

ikonst commented 1 year ago

We could augment dataclass_transform to support a new parameter that indicates whether a __dataclass_fields__ attribute is synthesized. In other words, it could be specified by the user explicitly.

PEP-681 says: "In the future, we may add additional parameters to dataclass_transform as needed to support common behaviors in user code. These additions will be made after reaching consensus on typing-sig rather than via additional PEPs."

Should we start a topic in typing-sig on adding a makes_dataclass parameter?

erictraut commented 1 year ago

Typing-sig is pretty much defunct at this point. I recommend using python/typing instead. But yes, that sounds like a good next step. I'm still not convinced that this is a problem that needs solving. It would be useful for you to present the motivation for a change in the discussion thread.

ikonst commented 1 year ago

👉 https://github.com/python/typing/discussions/1456