hauntsaninja / mypy_primer

Run mypy and pyright over millions of lines of code
MIT License
55 stars 29 forks source link

Better balanced shards #58

Closed A5rocks closed 1 year ago

A5rocks commented 1 year ago

I would love if mypy primer better balanced sharding. On a recent PR to mypy, I noticed that:

(Note that mypy-primer could take ~10 minutes less if it optimally balanced)

I know that it would be infeasible to construct lists for every single combination, so I propose:

What if every project had a "difficulty" number that was a rough estimate of time mypy takes to type check it? The idea is that you could try to balance these numbers into buckets (just use a greedy approach: from largest difficulty to smallest just always put it in the lowest difficulty bucket).

I'm not sure how we could keep these up to date though. Is there a metric that is simple to take but that correlates with mypy runtime? Number of files? Dependencies? Lines of code? Count of import typing?

hauntsaninja commented 1 year ago

Yeah, agreed that this would be nice.

The distribution is quite head heavy (cough sympy, pandas, graphql cough), so I think you could get most of the benefit by just adding a manual score to the longest ones. mypy_primer --measure-project-runtimes --concurrency 1 should show project runtimes.

An amusing fact: at one point I noticed there was a particularly bad sharding, so my quick fix was: https://github.com/hauntsaninja/mypy_primer/blob/236dab370d45dccd2ac17e67180cd7d3e99248af/mypy_primer.py#L60

A random musing: I've been curious about this but haven't looked at it yet is investigating how the mypyc-compiled mypy to pure-python mypy speed differs across projects