python / mypy

Optional static typing for Python
https://www.mypy-lang.org/
Other
18.56k stars 2.84k forks source link

Lazily process installed packages #17924

Open JukkaL opened 1 month ago

JukkaL commented 1 month ago

Overview

Currently if you import anything from a third-party package that has inline types, mypy will process all the transitive dependencies of the imported module as well. This can be pretty slow, since some packages have hundreds or thousands of dependencies (e.g. torch, but there are probably many others).

We could speed this up by only processing those dependencies that are needed to type check code that uses the third-party package. This is possible, since we won't generally report errors from installed packages. This wouldn't be possible for normal code, since we could have false negatives in code that we don't process. We'd process (some) imported definitions lazily.

Example

Assume we have third-party package acme that has 1000s of recursive module dependencies. Now we have user code that only uses one function from the top-level module:

from acme import do_stuff
do_stuff()

We might only need to process acme/__init__.py to type check this code. Most of the 1000 dependencies can be ignored, and they don't even need to be parsed. However, if do_stuff or other functions in acme/__init__.py use a type in an annotation that is defined in a submodule of acme, we might need to process modules that define those types as well, and any dependencies they might have. (This assumes module-level granularity of laziness. It's easy to also imagine definition-level laziness, so that only the do_stuff function would have to be processed.)

Implementation sketch

Here's a sketch of a potential implementation:

Discussion

Discussion:

Modules used as protocol-typed values could be an issue, since this could require arbitrary attributes (including nested imported modules) to be available, and we'd only know about this during type checking. So we might need to defer during type checking, and within various type operations such as subtype checks. This is probably pretty rare but still currently supported. Since this is expected to be rare, this doesn't need to be super efficient.

JukkaL commented 1 month ago

The approach outlined above would be quite hard to combine with parallel type checking. The only approach that comes to mind that might work is to have every parallel worker process all needed third-party dependencies (including stubs) independently, without any (or much) sharing of work. This might still be a win, if the speedup from lazy processing is bigger than could be achieved from proper parallel processing of third-party dependencies.

It may be better to start by trying to speed up the processing of third-party dependencies overall, i.e. make it faster to process thousands of modules where we don't report errors.

JukkaL commented 1 month ago

Because of the issues outlined above, we should probably start with lazily deserializing cache files. This would be easier to implement, and it would have some other nice properties:

JukkaL commented 1 month ago

Lazily processing entire SCCs (sets of cyclically dependent modules) may still be practical, even when they are not cached. We'd always process a third-party SCC imported from first-party code in full (when it's not cached), but if a third-party SCC depends on some other SCCs, those could be processed lazily.

For example, if we have a big dependency bar with a 1000-module SCC that also imports another big library foo somewhere within those 1000 modules, and foo is also slow to process, we could at least avoid processing foo each time bar is imported unless it's actually needed. I think torch depends on numpy and perhaps a bunch of other libraries, so it could help there.