Faster, parallel type checking

JukkaL commented 9 years ago

If we have a sufficiently large program to type check we should be able to speed up type checking significantly by using multiple type checker processes that work in parallel.

One potential way to implement this:

Have a coordinator process that maintains a module dependency graph and keeps track of a work list of modules that haven't been type checked. Initially this will contain all modules in the program.
Have multiple worker processes that repeatedly get (sets of cyclically dependent) modules from the coordinator that have their dependencies processed, type check them and mark them as processed on the coordinator.

For this to really work we are going to need a way of efficiently communicating type check results for a module between processes (to avoid type checking shared dependencies multiple times). Having a JSON serialization format (see #932) would be sufficient.

Additionally we need a quick way of figuring out the dependency graph of all modules (or at least an approximation of it). We'll probably have to cache that between runs and update it incrementally, similar to the approach outlined in #932.

So how much would this help? Obviously this depends on the shape of the dependency graph. Under reasonable assumptions we should be able to hit an effective level of parallelism of at least 4 for larger programs, but I wouldn't be surprised if we could get even better than that. Cyclic module dependencies can add a limit to how far we can parallelize. We can probably estimate the achievable level of parallelism for a particular program by analyzing the module dependency graph.

This is probably only worth implementing after we have incremental type checking (#932), and we should preserve incrementalism -- i.e., we'd only type check modules modified since the last run and modules that depend on them.

jhance commented 9 years ago

To give an idea of the scope of this, I think it would be useful to know precisely what data needs to be shared.

Is it enough to give the slave processes the semantically analyzed tree of the modules it depends on?

Idea: In this case we can envision changing the current build process from, instead of directly processing the next ready state, to finding states that are ready and adding them to a queue, which get assigned to slave processes. Slave processes will be sent the semantically analyzed trees of modules they depend on before being asked to run a pass on a particular tree. They will then send back the resulting tree (or whatever data).

JukkaL commented 9 years ago

I explained a possible serialization format in #932. We should be able to use the same format to share data about modules in parallel builds.

A slave process could work like this, I think:

Get a (cyclically dependent) set of files to type check from the coordinator process. For files with no cyclic dependencies the set will have just a single file. Exit if no more files are remaining in the work list at the coordinator. The get operation may block, in case the coordinator needs to wait for some other workers to finish their current jobs first.
If none of the files are more recent than cached serialized data about the files, just do nothing and skip to the last step. Alternatively, the coordinator can do this check and only send files that need to be processed.
Type check the files exactly like we'd do it in a single process case. In particular, process dependencies normally. We are guaranteed that the dependencies all have up-to-date serialized data available so dependencies will be quick to process. Send any type errors to the coordinator so that it can serialize output.
Store serialized data about all the files that we just processed on the disk so that other workers can reuse our results.
Notify coordinator that we're finished with these files. Loop back to beginning. Maintain state so that we only have to process each dependency at most once per worker.

I'm less clear about how the coordinator would work. Some ideas:

First incrementally update module dependency information for all modules, and cache that in separate JSON files, for example. For each file we'd maintain a list of directly imported modules. This data needs to be invalidated for all files whenever the module search path changes. Maybe we can use the standard library Python parser module to find imports in each module, as the mypy parser is pretty slow and we don't want this to become a bottleneck.
Load and analyze the module dependency graph to at least find sets of cyclically dependent files. Again, this needs to be quick. From this point on, every unit of type checking is a set of cyclically dependent files.
Divide units into four mutually exclusive states: 1) unprocessed units that have unprocessed dependencies, 2) unprocessed units that are ready to be type checked, 3) units being processed by a worker, 4) processed units.
Fire up N workers.
Send units in state 2 to workers and change them to state 3. If there are no units in state 2 but there are units in state 1, we may have to wait until a unit gets added to state 1. If there are no units in states 1 or 2, we can exit any workers that have nothing left to do.
Once a worker finished with a unit, switch it to state 4. Potentially move some units from state 1 to 2 if they now have all their dependencies processed.
We are finished when every unit is in state 4.

fzyzcjy commented 1 year ago

Hi, is there any updates? thanks

manmartgarc commented 1 year ago

This would be really cool. Is there any way I can help?

JukkaL commented 1 year ago

@manmartgarc Any help would be very welcome! I'm happy to chat about the details over Zoom if you are interested in working on this (you can find my email in the mypy commit history).

jgarvin commented 9 months ago

Currently I invoke mypy on all my files like mypy $(find . -name "*.py"). It's starting to get a bit slow so I found this ticket. Until real support is added, is it expected to work if I run two separate mypy instances myself at the same time on different files? Then I could still parallelize at the build system level with make or similar. I realize this may lead to some redundant work but it might still come out ahead. I'm mostly concerned that the instances might conflict both trying to write to mypy's cache.

aiven-anton commented 9 months ago

@jgarvin In my experience, running mypy only on a part of a code-base sometimes gives different results from a full mypy run. However, I believe the intention is for this to work, and any deviation is really a bug. I assume you would need to use separate cache directories for the processes to not corrupt each other (not sure this is true though).

Also just noting, when doing a full mypy run, there should usually not be any need to filter files with find, you can just configure mypy with your source directory and it will traverse it.

jgarvin commented 9 months ago

Hmm mypy cache on a relatively small project (<20k lines) is ~40MB. So for a 128 core machine I'd be spending 4GB of disk on mypy caches which in the grand scheme of things is not huge disk space consumption but I have to imagine touching that much disk must slow down the checking. I looked at the caches to see what was responsible for the size and it seems it's a ton of JSON files which is not a very compact format. Maybe pickle would work better, or sqlite.

dosisod commented 9 months ago

@jgarvin #15731 and #15981 include some methods on reducing the size of the cached JSON files. I mentioned it in the comments somewhere, but pickling might take up more space since we only store certain fields in the JSON files, meaning pickling could potentially include data we don't need/want (have yet to look into this).

Mypy does have a SQLite cache option, though it basically just stores the JSON data and filename in a table, see https://github.com/python/mypy/issues/3456#issuecomment-630501607 .

python / mypy

Faster, parallel type checking #933