Lazily process installed packages

JukkaL commented 1 month ago

Overview

Currently if you import anything from a third-party package that has inline types, mypy will process all the transitive dependencies of the imported module as well. This can be pretty slow, since some packages have hundreds or thousands of dependencies (e.g. torch, but there are probably many others).

We could speed this up by only processing those dependencies that are needed to type check code that uses the third-party package. This is possible, since we won't generally report errors from installed packages. This wouldn't be possible for normal code, since we could have false negatives in code that we don't process. We'd process (some) imported definitions lazily.

Example

Assume we have third-party package acme that has 1000s of recursive module dependencies. Now we have user code that only uses one function from the top-level module:

from acme import do_stuff
do_stuff()

We might only need to process acme/__init__.py to type check this code. Most of the 1000 dependencies can be ignored, and they don't even need to be parsed. However, if do_stuff or other functions in acme/__init__.py use a type in an annotation that is defined in a submodule of acme, we might need to process modules that define those types as well, and any dependencies they might have. (This assumes module-level granularity of laziness. It's easy to also imagine definition-level laziness, so that only the do_stuff function would have to be processed.)

Implementation sketch

Here's a sketch of a potential implementation:

Add a flag for packages where we don't report errors in any recursive dependencies. This should be enabled for installed packages and stubs.
Make sure installed packages and stubs can't import code where we do report errors. They should only import stubs and other installed packages. Otherwise there would be false negatives.
When processing an import targeting a "recursive no-error" package, initially don't consider any module dependencies. Add import placeholders to symbol table for imported symbols (unless they are already available).
If any code uses an import placeholder, defer the current node and keep track of the name of the placeholder target name. If any deferrals are due to import placeholders, process the import placeholders first before reprocessing the deferred nodes.
An name that is available as a placeholder only would be "unresolved".

Discussion

Discussion:

Imports that are only used within function bodies in installed packages don't need to be resolved. This could be a big win.
Imports of other modules/functions/classes within installed package that are not used don't need to be resolved. This could help with packages that have massive public APIs.
It would be easier if we can resolve all placeholders during semantic analysis. I'm not sure if this is possible, at least due to modules being able to implement protocols.
If we analyze a class, we should maybe resolve all import placeholders in the class and any base classes to avoid having to resolve them during type checking.
Also if we analyze a type annotation, we should maybe resolve all placeholders related to the type so that we can perform type checking without dealing with unresolved references.
As an optimization, we could do a quick AST pass to determine any imported names that are definitely needed based on a shallow syntactic analysis (e.g. look for from acme import submodule). When processing a module, we'd resolve these first to avoid numerous deferrals. I'm not sure if this would be a big perf win or not.
- Since big SCCs are common, we need to able to do this within SCCs, not just between SCCs.
- Circular dependencies need some care. We already support them, and probably the existing approach could be generalized.

Modules used as protocol-typed values could be an issue, since this could require arbitrary attributes (including nested imported modules) to be available, and we'd only know about this during type checking. So we might need to defer during type checking, and within various type operations such as subtype checks. This is probably pretty rare but still currently supported. Since this is expected to be rare, this doesn't need to be super efficient.

JukkaL commented 1 month ago

The approach outlined above would be quite hard to combine with parallel type checking. The only approach that comes to mind that might work is to have every parallel worker process all needed third-party dependencies (including stubs) independently, without any (or much) sharing of work. This might still be a win, if the speedup from lazy processing is bigger than could be achieved from proper parallel processing of third-party dependencies.

It may be better to start by trying to speed up the processing of third-party dependencies overall, i.e. make it faster to process thousands of modules where we don't report errors.

JukkaL commented 1 month ago

Because of the issues outlined above, we should probably start with lazily deserializing cache files. This would be easier to implement, and it would have some other nice properties:

Incremental runs are likely more common than non-incremental runs (outside CI).
This would help with all modules, not just third-party dependencies.
This should play nicely with parallel type checking.

JukkaL commented 1 month ago

Lazily processing entire SCCs (sets of cyclically dependent modules) may still be practical, even when they are not cached. We'd always process a third-party SCC imported from first-party code in full (when it's not cached), but if a third-party SCC depends on some other SCCs, those could be processed lazily.

For example, if we have a big dependency bar with a 1000-module SCC that also imports another big library foo somewhere within those 1000 modules, and foo is also slow to process, we could at least avoid processing foo each time bar is imported unless it's actually needed. I think torch depends on numpy and perhaps a bunch of other libraries, so it could help there.

python / mypy