PEP 749: Add section on metaclasses

JelleZijlstra commented 2 months ago

Deriving from https://discuss.python.org/t/pep-749-implementing-pep-649/54974/28 and offline discussion with Alex.

📚 Documentation preview 📚: https://pep-previews--3847.org.readthedocs.build/

carljm commented 2 months ago

After sleeping on this and looking at the proof of concept in pure Python, I'm a bit concerned about the performance implications. Classes are already large, but adding a new object instance with several fields to every class that exists could still have serious memory consequences for large Python codebases. I think we need to consider whether the bugs that are being fixed here are serious enough in practice to justify this overhead, or whether there are simpler solutions that don't fix everything perfectly but address the biggest problems, and don't have this overhead.

JelleZijlstra commented 2 months ago

@carljm it would be a small proportion of the overall size of a class object. In the prototype I'm currently working on, the size of the descriptor object is just 48 bytes, only 3% of the size of the class object:

>>> class X: a: int
... 
>>> ann = X.__dict__["__annotations__"]
>>> ann.get("a")
<class 'int'>
>>> ann.get("b")
>>> import sys
>>> sys.getsizeof(X)
1712
>>> sys.getsizeof(ann)
48

JelleZijlstra commented 2 months ago

python/cpython#120719 has a very partial prototype now (I haven't gotten around to testing it yet or integrating descriptor behavior).

carljm commented 2 months ago

What if we never set __annotations__ in the dictionary of any subclass of type? This would mean that introspecting annotations on a metaclass would always have to re-run __annotate__, but I think it would solve the problems described here?

Or alternatively, if we used the "cache annotations under a different name than __annotations__" strategy, but only for subclasses of type.

ncoghlan commented 2 months ago

The concern with the "cache under a different name strategy" is that it isn't fully backwards compatible: retrieving the annotations directly from the class dict wouldn't work anymore.

JelleZijlstra commented 2 months ago

The concern with the "cache under a different name strategy" is that it isn't fully backwards compatible: retrieving the annotations directly from the class dict wouldn't work anymore.

That's also true for the original text of PEP 649. It says:

Class and module objects must cache the annotations dict in their dict, using the key annotations. This is required for backwards compatibility reasons.

But that's not really true under PEP 649 as written: __dict__["__annotations__"] only exists if you previously warmed the cache by accessing the .__annotations__ descriptor. Arguably that's worse than removing __annotations__ completely from the dict: now whether or not __dict__["__annotations__"] works is path-dependent.

This makes me think we should either use a strategy where __dict__["__annotations__"] continues to work reliably (Alex's proposal, now in this PR), or we should go for a clean break and never set __annotations__ in the dict. If so, we should probably do the same for modules.

It also occurred to me that we may want to avoid using the class dict as storage for __annotate__. In the future, we may want to make optimizations so that __annotate__ is not stored as a function, but as just a code object that gets lazily materialized into a function. If we guarantee that cls.__dict__["__annotate__"] is a function, that optimization becomes harder to achieve.

Or alternatively, if we used the "cache annotations under a different name than __annotations__" strategy, but only for subclasses of type.

We could do something like this, but I'd rather have all classes work similarly, instead of special-casing some categories of classes.

AlexWaygood commented 2 months ago

In the proof-of-concept gist currently, @carljm is correct that every class -- even if it has no annotations -- gets a unique descriptor instance in its class dictionary:

class Object:
    """Pretend this is how builtins.object would work"""

    __annotate__ = None
    annotations = {}

    def __init_subclass__(cls):
        cls.annotations = _AnnotationsDescriptor(cls)
        if "__annotate__" not in cls.__dict__:
            cls.__annotate__ = None

But this does indeed seem unnecessary for classes that have no annotations. Instead, we could have a single descriptor instance in the class dictionary for builtins.object that is reused for all classes with no annotations:

class Object:
    """Pretend this is how builtins.object would work"""

    __annotate__ = None

    def __init_subclass__(cls):
        if "__annotate__" in cls.__dict__:
            cls.annotations = _AnnotationsDescriptor(cls)
        else:
            cls.__annotate__ = None
            cls.annotations = Object.__dict__["annotations"]

Object.annotations = _AnnotationsDescriptor(Object)

This should result in dramatically fewer descriptor instances being produced at class creation time. We would still create a fresh descriptor instance for every class that does have annotations and insert that instance into the class dictionary; but this is not so different to the way that an __annotations__ dictionary is inserted into the class dictionary for these classes in Python 3.13 anyway.

JelleZijlstra commented 2 months ago

@AlexWaygood that becomes problematic if the __annotations__ dict is modified. I can see a few options:

There is a single shared dict used by all unannotated classes. But now modifying that dict on one class affects all other unannotated classes.
Every time the descriptor is accessed, it returns a fresh dict. But now if you modify a class's __annotations__, the changes won't be preserved if you access __annotations__ again.
There is a lazily-created annotations dict for every class. But that means we need a per-class place to put it, and we're back where we started.
Instead of a mutable dict we return an immutable mappingproxy. Probably the best option, but still breaks use cases that rely on modifying the annotations dict.

carljm commented 2 months ago

It also occurred to me that we may want to avoid using the class dict as storage for __annotate__. In the future, we may want to make optimizations so that __annotate__ is not stored as a function, but as just a code object that gets lazily materialized into a function. If we guarantee that cls.__dict__["__annotate__"] is a function, that optimization becomes harder to achieve.

This is a really good point. That optimization was already implemented in the prototype implementation of PEP 649 that was available at the time of its acceptance, and although the optimization wasn't specified in the text of PEP 649, it was discussed publicly in advocating for the PEP prior to its acceptance, and I think it may turn out to be a valuable optimization. So it would be great if we can avoid closing the door to it.

JelleZijlstra commented 1 month ago

Coming back to this after a month, I'm now leaning towards the approach that the current version of this PR rejects:

Bypass normal attribute lookup when accessing these attributes, instead invoking the base class descriptor (from e.g., type.__dict__["__annotations__"]) directly.

This approach has the disadvantage that odd things may still happen if you access .__annotations__ directly, but on the other hand it can be implemented without adding any additional complexity or behavior changes to normal classes.

The details would be that we'd use this trick in annotationlib.get_annotations, and document clearly that accessing .__annotations__ directly on a class is unsafe.

This also implies we should add an annotationlib.get_annotate to safely return the __annotate__ function in the presence of metaclasses.

python / peps

PEP 749: Add section on metaclasses #3847