Bazel build integration

sixolet commented 7 years ago

I've been playing with getting Mypy to integrate with bazel for typechecking. I've got a basic version of it working. I intend to describe it here for discussion of the general approach, and then post a series of PRs for individual things it needs as I clean them up, if the general approach seems right.

The general approach: Make a mypy_library and mypy_binary macro that puts a typechecking aspect on the python files it contains. The aspect calculates the mypy cache files; give mypy a mode where it stops considering itself responsible for calculating freshness of dependencies, and instead let bazel handle that.

Specific PRs I expect to make:

A --cache-hermetically option to stop writing mtimes
A --cache-mapping option to specify a json file containing .data.json and .meta.json file locations by module id, both for cache files to read and new cache files to write (bazel likes being super specific about things like this; it seems cleaner to specify to mypy than to rely on mypy's cache directory layout). This option would also short-circuit the process of looking for source and checking it for freshness for the covered modules; just believe the cache files bazel is handing you.
Possibly a --module-without-init option (or some name, I don't know what it should be called) to allow mypy to believe things are modules without an __init__.py file. Bazel for some reason doesn't care whether you have them in your python source when you build. I can possibly hack around this by touching that file myself in the build directory in the bazel rule, instead.
A --ignore-module-prefix option to understand that certain parts of the module path are already accounted for, and all modules within the prefix to ignore should be treated as if they have empty __init__.py.
An actual Skylark implementation of the python typechecking macros, and a BUILD file for mypy.

That would be enough to get it working (and I have a prototype of that, which I need to split into separate PRs and clean up a lot if it seems an ok approach). To make it awesomer:

Speed up mypy's startup time. There might be low-hanging fruit?
Precompute the cache files for builtins and typeshed, so we never need to go looking for source there.
Implement some dependency pruning, so you don't even need to load all the cache files all the time.

I'm submitting this description before I'm ready to submit PRs deliberately, so I am not super attached to any one approach. Suggestions?

ethanhs commented 7 years ago

--module-without-init

I don't think this is a good idea. The issue sounds more to be with Bazel than Mypy, and I think it makes more sense to modify the build script until Bazel fixes this.

Precompute the cache files for builtins and typeshed, so we never need to go looking for source there.

I really, really like this idea. The typeshed information should be immutable therefore this makes a lot of sense. If someone is using a custom typeshed directory, it might get a bit more complicated, but in the general case and for releases, this will be really nice. Mypy currently spends 1.5 seconds on an empty file, so I do hope there are ways to improve startup time.

JukkaL commented 7 years ago

Some thoughts below. Note that this our internal team priority is to speed up mypy though a daemon mode instead of a build tool like bazel. Our review bandwidth may be lower than normal until the end of year, so there's a risk that bigger PRs in particular will be slow to review. It's better if we can first reach agreement on the design of new features at an abstract level before looking at PRs, as design reviews are much less work (and can save also implementation work in case the design gets iterated on a lot).

Speed up mypy's startup time. There might be low-hanging fruit?

I have a few ideas that may be helpful:

Make sure you don't compile the .pyc/.pyo files during each mypy run.
Remove type annotations in the released mypy version. Initializing annotations takes some time. After the recent optimizations the potential savings might be marginal, though.
Profile the time needed to evaluate the top level of each module in mypy, and see if there are things that are slower than expected. Named tuples are known to slow down program startup, for example.
See if a lot of code gets imported that isn't actually needed for most runs, and see if they can be lazily imported. Running code coverage might be an easy way to do this.

The non-bazel answer is to run mypy as a daemon and to use a small client that starts up very quickly. However, this probably is not a good match for bazel (though I've heard that it's possible). A related idea is to somehow keep a fully initialized mypy process running and just fork it whenever a fresh run is needed. This would provide isolation between mypy runs.

Precompute the cache files for builtins and typeshed, so we never need to go looking for source there.

This can't happen in mypy itself, since the cache files depend on mypy options such as platform and target Python version. However, it may make sense to have a build artifact in your bazel setup for typeshed with bundled cache files, and to use it instead of the standard typeshed though --custom-typeshed-dir.

A --cache-hermetically option to stop writing mtimes

Before adding a large number of command-line options, I'd like to see the output of mypy --help cleaned up so that it only shows the most commonly used options. We could point to the docs for information about other options, or perhaps we could have a separate help option for displaying verbose help with all options.

Implement some dependency pruning, so you don't even need to load all the cache files all the time.

Something like this could be pretty useful. Can you create a separate issue for this? This would be useful outside bazel as well.

gvanrossum commented 7 years ago

Sorry for the slow response!

Regarding --module-without-init: this is separately being proposed as --namespace-packages in a different issue: https://github.com/python/mypy/issues/2773. I'm thinking you'd like it to apply to Python 2 too? If so, please add a note there (the other org asking for this, Instagram, are on Python 3).

I'm not so worried about the proliferating number of command line flags -- at least, I don't think you need to pay for it before you can add new flags.

I think a --cache-hermetically flag would be pretty simple and self-contained, so go ahead and send us a PR for that. Do understand that it would only suppress the "mtime", not the option values contained in the meta files. (Actually it might also suppress the "path" field? Because the hash is the only things that matters.)

I'm not sure what the benefits of your propose --cache-mapping would be. IIUC this would be a dict giving e.g.

{"mypy.build": [".mypy_cache/3.6/mypy/build.meta.json", ".mypy_cache/3.6/mypy/build.data.json"],
 ...}

The contents of such a file (IIUC) should be entirely predictable given the values of --cache-dir and --python-version. But perhaps you care more about the "just believe the cache" part of the flag? Or perhaps that could be folded into --cache-hermetically? Hm, now I'm beginning to doubt my understanding of your proposed workflow.

I'm not sure what --ignore-module-prefix is supposed to do. What exactly do you mean by "module path"? The search path used to find module files, or the pathname of the module file?

JukkaL commented 7 years ago

I'm not so worried about the proliferating number of command line flags -- at least, I don't think you need to pay for it before you can add new flags.

Yeah, this should be up to the core team, since we are the ones who've added (or accepted) all the existing command-line options :-)

I have some vague ideas about improving mypy's startup time and cache deserialization speed, but I'm not yet sure if these are practical and when these will happen. Lazily loading only necessary dependent cache files of indirectly imported modules might be some relatively low-hanging fruit?

JukkaL commented 7 years ago

I said earlier:

Our review bandwidth may be lower than normal until the end of year

This may not actually need to be the case. I'm cautiously optimistic.

dgoldstein0 commented 7 years ago

any plan to actually do any of this work?

btw we've also seen the bazel-python problem around init.py and discussed it a bunch internally. My quick skim of the --namespace-packages proposal makes it sound like it'll solve our problem if we get py2 support for it.

Anyhow I hadn't looked into the internals of mypy at all and this proposal seems to cover that pretty well. From a bazel POV though, what we need is a way to invoke mypy in two (new?) modes:

give it a set of python and types files (perhaps .pyi) and generate a new type file output - so we can implement a build step that summarizes the types of a library, for it's dependencies to use
give it a set of python and types files of dependencies and typecheck the python - so we can typecheck a library, based on it's dependency libraries types

in addition to all the above stuff about making mypy hermetic and care less about existence of init files.

dgoldstein0 commented 7 years ago

and to clarify - if mypy has those two modes, I'd be willing to take a stab at writing the skylark to invoke mypy within bazel

gvanrossum commented 7 years ago

Would you be willing to also take a stab at implementing those two modes? We're pretty booked up with other perf work already.

dgoldstein0 commented 7 years ago

unfortunately I have no familiarity with the mypy codebase so I'm not comfortable committing to anything like that. I'm also pretty busy with my actual day job, but if you convince me it's not too hard then maybe I'd look into it. But jumping into a new codebase I have no confidence this would be quick.

On Wed, Nov 1, 2017 at 7:37 PM Guido van Rossum notifications@github.com wrote:

Would you be willing to also take a stab at implementing those two modes? We're pretty booked up with other perf work already.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/python/mypy/issues/3912#issuecomment-341300908, or mute the thread https://github.com/notifications/unsubscribe-auth/ABBnd8Ya4Adc3BJYCzug710MxIBdr0arks5sySrngaJpZM4PLGh4 .

JukkaL commented 7 years ago

give it a set of python and types files (perhaps .pyi) and generate a new type file output - so we can implement a build step that summarizes the types of a library, for it's dependencies to use

Using the existing JSON serialized cache files for this would be both faster and less work to implement, at the cost of not having an easily human-readable representation.

The mypy startup performance was again brought up by @sixolet last week (offline). Here pruning dependencies (either by not loading some dependencies at all or by loading them lazily) would likely be the low-hanging fruit. Other possible optimizations are harder:

Speed up importing the mypy implementation and dependencies. Pruning internal mypy dependencies might help a bit, and Python 3.7 might also improve this a little. Larger improvements will likely require somewhat drastic measures -- but these are still within the realm of possibility.
Make deserialization of JSON cache files faster. Again, this is non-trivial but conceptually it should be possible to make this much faster.

sixolet commented 7 years ago

I made a proof of concept a couple months ago. I still plan to turn it into real PRs as soon as I dig myself out of all my free time going to baby care :P

On Thu, Nov 2, 2017 at 4:58 AM, Jukka Lehtosalo notifications@github.com wrote:

give it a set of python and types files (perhaps .pyi) and generate a new type file output - so we can implement a build step that summarizes the types of a library, for it's dependencies to use

Using the existing JSON serialized cache files for this would be both faster and less work to implement, at the cost of not having an easily human-readable representation.

The mypy startup performance was again brought up by @sixolet https://github.com/sixolet last week (offline). Here pruning dependencies (either by not loading some dependencies at all or by loading them lazily) would likely be the low-hanging fruit. Other possible optimizations are harder:

Speed up importing the mypy implementation and dependencies. Pruning internal mypy dependencies might help a bit, and Python 3.7 might also improve this a little. Larger improvements will likely require somewhat drastic measures -- but these are still within the realm of possibility.

Make deserialization of JSON cache files faster. Again, this is non-trivial but conceptually it should be possible to make this much faster.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/python/mypy/issues/3912#issuecomment-341399717, or mute the thread https://github.com/notifications/unsubscribe-auth/ABjs4Xaygd4iq0Dh_FUt22dVRFbErhx1ks5sya58gaJpZM4PLGh4 .

gvanrossum commented 6 years ago

I am trying to wrap my head around this. I'm still reading up on bazel but I think this is going to be fun!

kamahen commented 6 years ago

A few thoughts from someone who has done something similar ...

The mypy cache could be specified as one of the outputs in the mypy aspect's action (and input(s) to other rules). Then Bazel would take care of it for you and re-run only what's necessary.

Bazel allows only one rule for creating an output. But your BUILD files probably have some rules that specify the same source files as other rules. There are a few solutions to this, but probably the easiest is to add the rule name to the output file (to make it unique) and have the Skylark create a mapping file that is input to mypy to map between the Bazel label and the actual file. You can see an example of this in pytype/imports_map_loader.py. Another way would be similar to how Java rules work with .class files.

@gvanrossum - trust me, this quickly becomes not-fun. However, the documentation is much better now than when I did this, and Skylark has far fewer bugs now (but be warned -- it is not Python).

gvanrossum commented 6 years ago

Here's a brief progress report. Things are not working yet.

I have made a mypy branch that supports a --bazel flag that does various things for Bazel. For now I've simply vendored this branch in the Dropbox repo where I'm experimenting with Bazel. I'm still hacking on the code because I haven't solved all problems yet.

I've also created a dropbox-specific Bazel aspect that runs mypy. I don't want to publish this yet since it's gross and doesn't work yet, but I've learned a lot.

Perhaps the key learning (besides confirming Peter's observation :-) is that I can't seem to create the mypy cache files alongside the source files, because Bazel doesn't just require that outputs live under the current package. There's a cross-compilation feature that puts all outputs in a different directory, and this takes the form <root>/<prefix>/<package>/<target>/<file> where <root> is some horrendously long Bazel directory (fortunately it's also the current directory :-), <prefix> is something indicative of the target platform (in our case it's bazel-out/k8-fastbuild/bin -- possibly this is a Dropbox-specific thing), <package> is the package name (the directory containing the BUILD file relative to the repo/workspace root), <target> is the name attribute of the target in the BUILD file, and <file> is whatever you want. Source files on the other hand live in <root>/<package>/<file>. I presume this architecture maximizes parallelism.

So suppose we have two deps <target1> and <target2> whose mypy cache output files are inputs to <target3>, the cache files are distributed across three directories, and there's no --cache-dir flag we can pass to mypy that changes this. I guess Naomi's original idea of having a cache file mapping is probably the simplest way to handle this. ~I haven't started writing code for this solution yet (the branch above was a hopeful attempt that just helped me debug the situation).~

dgoldstein0 commented 6 years ago

I don't follow your discussion entirely, but usually I just make sure that for my bazel rules, all paths that my tools have to deal with are passed either in a data file (json or newline delimited or whatever) or as explicit command line params. E.g. if I was building mypy support, I would try to pass the paths of all the current files to typecheck to mypy that way. And definitely I would try to avoid giving the mypy bazel action a folder instead of an explicit list of files.

At least that's my 2cents.

On Fri, Mar 16, 2018 at 6:32 PM Guido van Rossum notifications@github.com wrote:

Here's a brief progress report. Things are not working yet.

I have made a mypy branch https://github.com/gvanrossum/mypy/tree/bazel-flag that supports a --bazel flag that does various things for Bazel. For now I've simply vendored this branch in the Dropbox repo where I'm experimenting with Bazel. I'm still hacking on the code because I haven't solved all problems yet.

I've also created a dropbox-specific Bazel aspect that runs mypy. I don't want to publish this yet since it's gross and doesn't work yet, but I've learned a lot.

Perhaps the key learning (besides confirming Peter's observation :-) is that I can't seem to create the mypy cache files alongside the source files, because Bazel doesn't just require that outputs live under the current package. There's a cross-compilation feature that puts all outputs in a different directory, and this takes the form
//// where is some horrendously long Bazel directory (fortunately it's also the current directory :-), is something indicative of the target platform (in our case it's bazel-out/k8-fastbuild/bin -- possibly this is a Dropbox-specific thing), is the package name (the directory containing the BUILD file relative to the repo/workspace root), is the name attribute of the target in the BUILD file, and is whatever you want. Source files on the other hand live in //. I presume this architecture maximizes parallelism. So suppose we have two deps and whose mypy cache output files are inputs to , the cache files are distributed across three directories, and there's no --cache-dir flag we can pass to mypy that changes this. I guess Naomi's original idea of having a cache file mapping is probably the simplest way to handle this. I haven't started writing code for this solution yet (the branch above was a hopeful attempt that just helped me debug the situation). — You are receiving this because you commented. Reply to this email directly, view it on GitHub , or mute the thread .

kamahen commented 6 years ago

Bazel has no concept of "folder". You should always use File.path to translate between the Bazel label and the actual file that your program consumes.

gvanrossum commented 6 years ago

Well, I solved the cache problem by implementing a --cache-map feature. It seemed simplest to just pass the mapping on the command line rather than writing it to a file and reading it from a file -- Bazel has a "spill file" feature that can do that as needed (and mypy can read it).

Doing this I managed to actually check some targets and their transitive dependencies. (I even found some questionable code.)

I made less headway with the __init__.py file problem -- hacking mypy so that (with the --bazel flag) it doesn't insist on an __init__.py file in every package is not so easy, since this is relied on in multiple places. To make progress in the meantime, I wrote a terrible hack where a wrapper script just creates __init__.py files as needed before it runs mypy.

gvanrossum commented 6 years ago

@sixolet please have a look at my prototype for the mypy changes at #4759.

I've now solved the __init__.py problem in Skylark by using a macro that calls native.glob(['**/__init__.py']), plus some clever use of depset() in the aspect implementation. (I would show my code but it's too messy.)

gvanrossum commented 6 years ago

A similar solution for .pyi files also seems to work, even if it's not clear that .pyi files should be implicitly substituted in the srcs attribute of relevant rules, which is what I currently do -- similar as for __init__.py where it appears to be the Bazel convention.

kamahen commented 6 years ago

@gvanrossum If you want a review of the Bazel code, I can take a look. (Messiness seems to be a hallmark of Skylark code, so don't feel ashamed of it ... even copious documentation and design notes fail to help in my experience (I had my one of my integration designs reviewed by the Bazel and Skylark team and yet everyone missed a horrible flaw in it.)

What do you mean by "similar as for __init__.py where it appears to be a Bazel convention"? Are you referring to Bazel auto-creating a __init__.py file if it doesn't exist in a source directory? If so, that apparently was a very old thing at Google, and far from universally liked. I could probably point you to someone who knows more about the history and pluses/minuses of that "feature".

gvanrossum commented 6 years ago

Nah, I'm good, I have internal reviewers.

gvanrossum commented 6 years ago

Alas, it seems my pure Skylark solution for implicit __init__.py files does not work. Consider a package p with a subpackage q (i.e. the latter is p/q in the filesystem, or p.q in the module namespace). Inside p/q Bazel's glob() doesn't find p/__init__.py even if it exists. Hence with my hack the file p/q/foo.py is seen by mypy as representing q.foo rather than p.q.foo and then everything else goes wrong. Thus I think I will have to make the --bazel flag (or another flag) treat p/q/foo.py as representing p.q.foo even if p/__init__.py does not exist (and even if q/__init__.py does exist). I had previously tried to make this work with a quick hack and failed. I will now have to try again. (And risk a merge conflict when PEP 561 lands.)

kamahen commented 6 years ago

I don't see why the glob() doesn't find p/__init__.py. If you use the pattern **/__init__.py, won't you get a list of all the __init__.py files? (which you can then filter as needed)

gvanrossum commented 6 years ago

That was exactly the pattern I was using. But glob uses the current package (i.e. p/q) as its starting point. I think it's specified in item 7 here: https://docs.bazel.build/versions/master/be/functions.html#glob

kamahen commented 6 years ago

I think that ../**/__init__.py isn't allowed in a glob.

I recall having a lot of problems with the details of Python's import model, even when assuming that missing __init__.pys could be hallucinated into existence. (And assuming that there aren't any special import hooks.) Good luck. ;)

kamahen commented 6 years ago

You might try shifting the test to the action ... you can get a list of all the files that Bazel thinks are needed, and then perhaps apply some heuristics (because some of the files might be from the source tree and others could be generated in one of the bin or genfiles directories). You might be able to find the output directories from predefined make-variables

gvanrossum commented 6 years ago

Whew! #4759 has finally landed. I am hoping to open-source the Skylark code I wrote too, though I cannot support it (it's got some Dropboxisms).

matthen commented 6 years ago

Any updates on this project? Do you have any example working build rules?

gvanrossum commented 6 years ago

There are actually plans to release a bunch of Bazel work done at Dropbox (e.g. Go and Python rules and a tool for auto-generating dependencies), and this would be part of that. We don't have a specific timeline for the release, but we're aiming to do it before the end of 2018. Hope that helps! Rest assured that it is in production use at Dropbox.

thundergolfer commented 5 years ago

but we're aiming to do it before the end of 2018.

I'm looking to get Python + Bazel + MyPy going in my team, so jumping back in to check if there's been updates and generally what the state of this integration is.

gvanrossum commented 5 years ago

Sorry, we haven't been able to prioritize the public release of the Skylark files that define our Python targets, so alas, there are no public updates. Internally we still use the --bazel flag and related things, so it's not dead, but it's not easy to separate it from internals.

thundergolfer commented 5 years ago

Understandable. Thanks for the update.

thundergolfer commented 4 years ago

Jumping back in here to give a heads up that I've started an open-source implementation of this here → https://github.com/thundergolfer/bazel-mypy-integration

It's still very much in the pre-Alpha stage, but I've got the very basics happening. I have a lot to do in terms of fixing bugs, smoothing sharp edges, and giving it a proper exercising in a real-world Bazel monorepo, but it's a start!

Feedback and advice appreciated 🙏

JukkaL commented 4 years ago

Because of shifts in priorities, it seems unlikely that we'll be able to release our Bazel integration any time soon. Keeping this open as a low-priority issue since there is a lot of potentially useful discussion in case somebody wants to work on this.

kamahen commented 4 years ago

FWIW, I've come up with a way of running source analysis in parallel and caching the results, that gets a good speed-up and doesn't require a tool like Bazel. If there's any interest in this, I can try to write it up -- the code is straightforward but not well documented (I have some documentation if anyone wants to read it; but it needs editing by a good tech writer).

JukkaL commented 4 years ago

If the main motivation is performance, remote caching together with mypy-daemon will often give much faster mypy runs than Bazel integration (https://mypy.readthedocs.io/en/stable/additional_features.html#using-a-remote-cache-to-speed-up-mypy-runs).

JukkaL commented 4 years ago

(The level of performance obviously depends on various factors, and in some cases Bazel can be a performance win.)

thundergolfer commented 4 years ago

Keeping this open as a low-priority issue since there is a lot of potentially useful discussion in case somebody wants to work on this.

I have been working on it, and we've been using it at work for a few weeks now with no issues. https://github.com/thundergolfer/bazel-mypy-integration

If you see any problems with my implementation please do let me know 🙂.

python / mypy

Bazel build integration #3912