Open JukkaL opened 9 years ago
To give an idea of the scope of this, I think it would be useful to know precisely what data needs to be shared.
Is it enough to give the slave processes the semantically analyzed tree of the modules it depends on?
Idea: In this case we can envision changing the current build
process from, instead of directly processing the next ready state, to finding states that are ready and adding them to a queue, which get assigned to slave processes. Slave processes will be sent the semantically analyzed trees of modules they depend on before being asked to run a pass on a particular tree. They will then send back the resulting tree (or whatever data).
I explained a possible serialization format in #932. We should be able to use the same format to share data about modules in parallel builds.
A slave process could work like this, I think:
I'm less clear about how the coordinator would work. Some ideas:
Hi, is there any updates? thanks
This would be really cool. Is there any way I can help?
@manmartgarc Any help would be very welcome! I'm happy to chat about the details over Zoom if you are interested in working on this (you can find my email in the mypy commit history).
Currently I invoke mypy on all my files like mypy $(find . -name "*.py")
. It's starting to get a bit slow so I found this ticket. Until real support is added, is it expected to work if I run two separate mypy instances myself at the same time on different files? Then I could still parallelize at the build system level with make or similar. I realize this may lead to some redundant work but it might still come out ahead. I'm mostly concerned that the instances might conflict both trying to write to mypy's cache.
@jgarvin In my experience, running mypy only on a part of a code-base sometimes gives different results from a full mypy run. However, I believe the intention is for this to work, and any deviation is really a bug. I assume you would need to use separate cache directories for the processes to not corrupt each other (not sure this is true though).
Also just noting, when doing a full mypy run, there should usually not be any need to filter files with find
, you can just configure mypy with your source directory and it will traverse it.
Hmm mypy cache on a relatively small project (<20k lines) is ~40MB. So for a 128 core machine I'd be spending 4GB of disk on mypy caches which in the grand scheme of things is not huge disk space consumption but I have to imagine touching that much disk must slow down the checking. I looked at the caches to see what was responsible for the size and it seems it's a ton of JSON files which is not a very compact format. Maybe pickle would work better, or sqlite.
@jgarvin #15731 and #15981 include some methods on reducing the size of the cached JSON files. I mentioned it in the comments somewhere, but pickling might take up more space since we only store certain fields in the JSON files, meaning pickling could potentially include data we don't need/want (have yet to look into this).
Mypy does have a SQLite cache option, though it basically just stores the JSON data and filename in a table, see https://github.com/python/mypy/issues/3456#issuecomment-630501607 .
If we have a sufficiently large program to type check we should be able to speed up type checking significantly by using multiple type checker processes that work in parallel.
One potential way to implement this:
For this to really work we are going to need a way of efficiently communicating type check results for a module between processes (to avoid type checking shared dependencies multiple times). Having a JSON serialization format (see #932) would be sufficient.
Additionally we need a quick way of figuring out the dependency graph of all modules (or at least an approximation of it). We'll probably have to cache that between runs and update it incrementally, similar to the approach outlined in #932.
So how much would this help? Obviously this depends on the shape of the dependency graph. Under reasonable assumptions we should be able to hit an effective level of parallelism of at least 4 for larger programs, but I wouldn't be surprised if we could get even better than that. Cyclic module dependencies can add a limit to how far we can parallelize. We can probably estimate the achievable level of parallelism for a particular program by analyzing the module dependency graph.
This is probably only worth implementing after we have incremental type checking (#932), and we should preserve incrementalism -- i.e., we'd only type check modules modified since the last run and modules that depend on them.