Document which operations ignore overridden methods

JukkaL commented 4 years ago

Currently we don't honor overrides for certain operations on primitive types. If a subclass of list overrides __getitem__, for example, we still use list.__getitem__ for instances of the subclass if the static type is list. The motivation is improving performance of some of the most common operations.

We don't have a detailed policy about this, and it's often unclear when implementing new primitives what to do about subclasses. I propose that we should restrict ignoring overrides to only few specific operations. It would also be to good to document this clearly and have a term that we can use to refer to these operations.

We don't support overrides for any value types, including int and fixed-width tuples. This is something we can't change easily, since we actually convert values at runtime and lose the original type. I take this as a given.

I think that additionally at least these list operations can ignore overrides:

List __getitem__ with int index
List __setitem__ with int index
len(listobj)
for x in listobj (and related, such as with enumerate)

list.append is another candidate, though for many uses of append we could use static analysis to infer that the concrete type is always list (unlike the operations above).

In particular, I think that dict operations and variable-length tuple operations should always support overrides.

Discussion:

This is a deviation from Python semantics and can be confusing. It makes sense to minimize the differences. This should be documented like other deviations.
Fundamental list operations potentially benefit from this the most, as there are various loop optimizations that could be applied. If list objects can override basic operations, optimizing loops would require compiling the loop twice, once for list objects and once for subclasses. This would slow down compilation, increase code size, and complicate the compiler.
Most other operations, including dict operations, aren't performance-sensitive enough to see significant performance improvements from ignoring overrides.
It's somewhat common to subclass dictionaries and override basic behavior (e.g., defaultdict).
Probably the most common dict operation (dict.__getitem__) already has to deal with overrides anyway (due to stdlib dict subclasses), so there's not much justification for ignoring overrides in less common operations, in my opinion.
Subclassing list and overriding basic behavior is not as usual. I found two examples in the stdlib Python code where this happens (outside tests), and one of them was only used internally in a class.
Other list operations beyond those mentioned above either are a not an issue (e.g. [x] * n usually can be optimized anyway), are less common, or the relatively performance impact is less.
To work around the difference in semantics, code can use non-list types such as Any, Sequence, MutableSequence, or the subclass type. These will match Python semantics.

Thoughts?

ilevkivskyi commented 4 years ago

I am not sure we need any general rules here, and can rather decide on case by case basis (like e.g. for dictionary iterators we decided to handle subclasses safely). However, ideally all deviations from CPython behavior should be clearly documented.

JukkaL commented 4 years ago

I prefer having the general rule for several reasons:

Less friction for contributors and code reviewers: I remember having these same discussions many times when new primitives were added. If we have a general rule, contributors can know beforehand what is expected. Reviewers have an easier time, as they don't need to perform corpus analysis, manual testing or other open-ended investigations to decide whether a primitive should allow overrides.
Learnability: If the general rule says that there are only, say, 4 primitives that are special, it's much easier to learn that if there were, say, 20 primitives like this. Also, mistakes are less likely.
Compatibility: Again, the fewer deviations we have, the easier it will be to migrate code to mypyc. If we have a general rule of not adding deviations like these (unless there is a very compelling argument), the number of deviations is unlikely to grow much. And even if a reviewer can't find a reason why a particular primitive should allow overrides, it's possible that there's some common use case that we aren't aware of where it would cause friction.
Easier migration between mypyc releases: If the set of deviations is mostly fixed, it will be easier to migrate to a newer mypyc version. Otherwise users will need to monitor release notes about additional primitives that have changed behavior and look for instances of those in their codebases. Each new primitive that disallow overrides is a compatibility break with earlier mypyc versions.

The same arguments arguably apply to special casing builtin functions, to a certain extent. However, monkey patching builtin functions seem very rare and a questionable practice, whereas subclassing builtin classes happens a lot, and even the implementation of mypy does this.

JukkaL commented 4 years ago

However, monkey patching builtin functions seem very rare...

Besides, monkey patching already much restricted, so users need to learn about monkey patching in any case when starting to use mypyc.

JukkaL commented 4 years ago

@msullivan Do you have any thoughts about this proposal?

mypyc / mypyc

Document which operations ignore overridden methods #722