Create cache for pypi modules

keith commented 1 month ago

Currently if you have multiple libraries in your dependency tree that both depend on a large pypi dependency, such as pytorch, each time mypy is run on them their own cache gets populated with info about pytorch. Then each time they are rerun and that has to be recalculated (which takes on the order of minutes). Ideally it would be possible to share a mypy cache from one of these nested targets so this only had to be done once, and it would be more resilient to invalidations

Is this feasible?

keith commented 1 month ago

Maybe there's a better way to solve this problem in general? We have multiple huge third party deps that absolute destroy mypy runtimes

keith commented 1 month ago

The nuclear option would be to make the mypy runner a persistent worker that shared a hidden cache from bazel so the cache would be longer running. I have no idea idea if the mypy cache is sound enough for that

mark-thm commented 1 month ago

If you have a setup like:

pytorch <- A pytorch <- B A <- B

It should already be the case that B receives a mypy cache that’s fully populated with the analysis of pytorch, at least to the extent A and B depend on the same modules in pytorch.

It’s more problematic if the last edge doesn’t exist, and thus each of A and B would independently analyze pytorch.

We could run the aspect on external deps (currently we explicitly do not do this), at the cost of running type checking on pytorch itself — but that’s undesirable for a few reasons, not the least of which is that pytorch may not actually pass your configuration of mypy’s typecheck. I don’t think there’s an analyze-only mode, but if there is we could certainly run that instead.

That said, I’ve found even with large 3rd party deps that proper configuration of mypy doesn’t lead to checking the full 3rd party deps, rather, it extracts the type information from the API surface you end up using.

Are you using the settings mentioned in https://github.com/theoremlp/rules_mypy?tab=readme-ov-file#configuring-mypy-with-mypyini ?

mark-thm commented 1 month ago

I’d also mention that at some point cache performance may get in the way — in our use of this ruleset we disable the cache because the network time is worse than rerunning mypy on several of our deep-in-the-graph targets. That’s probably down to poor cache tuning on our part, but wanted to note as something you might try to see if it helps your overall performance.

mark-thm commented 1 month ago

https://github.com/theoremlp/rules_mypy/blob/2e1867c021a701fb4717649368e9f3d13e87a5d1/mypy/private/mypy.bzl#L145

keith commented 1 month ago

It’s more problematic if the last edge doesn’t exist, and thus each of A and B would independently analyze pytorch.

this is our case, we have ~50 test targets that depend on torch, with literally 0 connections between them

I am currently testing with the recommended config, seems like changing follow_imports = silent to follow_imports = skip fixes the perf issue, and if I add --verbose to the mypy command i can see that it's traversing thousands of torch files. Instead of doing this globally this also seems to work around it:

[mypy-torch.*]
follow_imports = skip

(ofc at the cost of losing some fidelity)

mark-thm commented 1 month ago

gotcha. I can try to give you an option to let the aspect run on dep's py_library, which should produce a cache?

keith commented 1 month ago

yea so you're thinking if it looked at all direct deps, but not all transitive pypi deps it would be enough? i think that could work

keith commented 1 month ago

it's possible in our case that either way we'd just disable it for torch, im not sure, i guess it depends on how effective that was versus how useful having torch type validation would be

mark-thm commented 1 month ago

I mean I could enable you to selectively remove https://github.com/theoremlp/rules_mypy/blob/2e1867c021a701fb4717649368e9f3d13e87a5d1/mypy/private/mypy.bzl#L23-L24

...which would build a cache by type-checking pytorch and result in the cache being available to targets that depend on pytorch. In theory that cache can get built with error signaling disabled, which should be roughly an analysis pass

theoremlp / rules_mypy

Create cache for pypi modules #23