tweag / rules_haskell

Haskell rules for Bazel.
https://haskell.build
Apache License 2.0
264 stars 79 forks source link

Hermetic GHC bindist on Unix #1393

Open aherrmann opened 4 years ago

aherrmann commented 4 years ago

Is your feature request related to a problem? Please describe. The GHC bindist on Unix requires a ./configure && make && make install step before the toolchain can be used. Currently, haskell_register_ghc_bindists executes this installation in a repository rule. This means that this step is not sandboxed and cannot be cached by Bazel (the repository cache is limited to downloads). Additionally, repository rules don't have access to Bazel's toolchain resolution, so we cannot point GHC to the correct CC toolchain during ./configure and instead it will find whichever toolchain is available in the environment.

Describe the solution you'd like I would like for the ./configure && make && make install step to happen in a regular sandboxed build action. This would make this installation step cacheable and allow us to control which CC toolchain GHC discovers at this step.

Describe alternatives you've considered Users can avoid these issues today by using a nixpkgs provided GHC instead.

Another approach would be to pull in a system installed GHC from outside Bazel as described in https://github.com/tweag/rules_haskell/issues/1320. I think it would be good if rules_haskell supported both a hermetic haskell_register_ghc_bindists as well as haskell_register_ghc_host.

Additional context One difficulty is that we will need to predict which files this installation step produces exactly, due to the lack of dynamic dependencies in Bazel. This will depend on the platform and GHC version.

One approach would be to let the installation step run in a mode where it produces only metadata: the list of files, package dependencies, etc. (essentially what pkgdb_to_bzl.py determines). This metadata could then be checked in (similar to a lock file) and be used to predict the outputs of the installation step going forward. We could ship the lock files for the Linux and MacOS GHC bindists that are supported by rules_haskell as part of rules_haskell. This way this approach would not change the current API.

Note, the GHC bindist on Windows does not require any ./configure or make steps and is useable immediately after unpacking.

mboes commented 4 years ago

We could solve that by maintaining maps from GHC (version, platform) -> partial file list. Even host that on a server somewhere if not in rules_haskell itself. After all, we're already doing something similar by hardcoding the SHA's of each bindist we know about. Or we could maintain pre-configured and relocatable bindists, and have the bindist rules use that instead.

aherrmann commented 4 years ago

We could solve that by maintaining maps from GHC (version, platform) -> partial file list. Even host that on a server somewhere if not in rules_haskell itself. After all, we're already doing something similar by hardcoding the SHA's of each bindist we know about.

Yes, hosting these in rules_haskell itself is what I had in mind with

approach would be to let the installation step run in a mode where it produces only metadata: the list of files, package dependencies, etc. (essentially what pkgdb_to_bzl.py determines). This metadata could then be checked in (similar to a lock file) and be used to predict the outputs of the installation step going forward. We could ship the lock files for the Linux and MacOS GHC bindists that are supported by rules_haskell as part of rules_haskell.

Hosting them separately is an interesting idea. That would avoid bloating the repo too much. We could amend the bindist mapping with URLs and hashes of these hosted metadata files.

mboes commented 4 years ago

Hosting them separately is an interesting idea. That would avoid bloating the repo too much. We could amend the bindist mapping with URLs and hashes of these hosted metadata files.

The only downside is that it introduces an extra point of failure. But the bindists are downloaded from the network anyways, so requiring another network download for metadata should be ok.

Profpatsch commented 4 years ago

How much bloat are we talking here? A few Mbs should be totally feasible, since the rule users only fetch the tarball, not the complete git history.

On Wed, Aug 5, 2020 at 12:00 PM Mathieu Boespflug notifications@github.com wrote:

Hosting them separately is an interesting idea. That would avoid bloating the repo too much. We could amend the bindist mapping with URLs and hashes of these hosted metadata files.

The only downside is that it introduces an extra point of failure. But the bindists are downloaded from the network anyways, so requiring another network download for metadata should be ok.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tweag/rules_haskell/issues/1393#issuecomment-669102633, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYB5ZXZ7TN47ASROPVCNQTR7EUVNANCNFSM4OYTLAJA .

aherrmann commented 4 years ago

How much bloat are we talking here? A few Mbs should be totally feasible, since the rule users only fetch the tarball, not the complete git history.

Listing the files of the installed GHC 8.8.2 bindist requires about 500k, adding metadata like package dependencies shouldn't take much more. rules_haskell currently lists bindists for 15 GHC versions on two Unix OSs (we don't need this for Windows). That adds up to ~15M. The rules_haskell repository currently takes up 3M uncompressed, so that would be a pretty large increase in size.

If we gzip the file list we get down to ~40k lets say 50k to be conservative. Then this adds up to 1.5M. That's still a pretty large increase compared to the current size of rules_haskell, but much less bad. We could also consider pruning the list of GHC bindists to further reduce the size, currently it goes back all the way to 7.10.3.

mboes commented 4 years ago

Interesting numbers. Ideally we wouldn't need to list all files. Just know where they are given the version number and the platform. Perhaps this could be achieved with ctx.actions.declare_directory(). The problem with the latter is that it can't overlap with ctx.actions.declare_file() targets, which we need for each of the individual ghc commands we want to call. But maybe those could be symlinks to somewhere inside the declared directory (as noted in @aherrmann's previous comment, Windows isn't necessary anyways).

Another thought that came to mind - if Windows doesn't need ./configure, why does Linux/macOS? We could pre-configure relocatable bindists and host them on GitHub or elsewhere. i.e. bypass the official bindists entirely.

aherrmann commented 4 years ago

Ideally we wouldn't need to list all files. Just know where they are given the version number and the platform. Perhaps this could be achieved with ctx.actions.declare_directory(). The problem with the latter is that it can't overlap with ctx.actions.declare_file() targets, which we need for each of the individual ghc commands we want to call. But maybe those could be symlinks to somewhere inside the declared directory

We also need libraries for CcInfo and .haddock files for HaddockInfo. These can share directories with .hi or .html files respectively. I'm not sure if it's worth the additional complexity.

Another thought that came to mind - if Windows doesn't need ./configure, why does Linux/macOS? We could pre-configure relocatable bindists and host them on GitHub or elsewhere. i.e. bypass the official bindists entirely.

That's a very good question. If it's possible to pre-configure relocatable bindists for Unix then that sounds like a great solution. If this is possible, it may also be worth asking why the upstream GHC bindists aren't configured that way in the first place.

mboes commented 4 years ago

I asked the question on ghc-dev@: https://mail.haskell.org/pipermail/ghc-devs/2020-August/019126.html.

The thread seems to indicate that having upstream ship relocatable bindists with a much simpler configure step that only writes out the settings file is doable. I think we ought to work with upstream to solve the problem at that level. (See the golden rule of software quality.)

Profpatsch commented 4 years ago

I think we ought to work with upstream to solve the problem at that level. (See the golden rule of software quality.)

I love how this blogpost is already being linked on discussions like these, I had the same thought when I read Andreas’ comment above.

Profpatsch commented 4 years ago

Though https://mail.haskell.org/pipermail/ghc-devs/2020-August/019128.html makes me suspect that until Hadrian hits we can gain a lot by shipping our own pre-extracted binary releases for now, just push to an s3 bucket from CI?

Profpatsch commented 4 years ago

Especially since upstream is worrying about super-edge cases like AIX, and afaik none of the users of rules_haskell uses AIX (I didn’t even know this unix variant still existed tbh).

symbiont-ji commented 2 years ago

Has this issue been abandoned? It's almost two years since the last update.

It takes our CI (circleci+runners on GCP, ubuntu) over a minute to fetch the compiler. Since every CI job is run on a new instance, we pay that penalty several times an hour. Can you suggest a workaround, even if it is only for linux?

aherrmann commented 2 years ago

@symbiont-ji It's something that we would still like to implement, and I think it would be an important feature. However, it's also a non-trivial issue.

Can you suggest a workaround, even if it is only for linux?

Many rules_haskell users on Linux use Nix to provision GHC. In that case this is not an issue. I'm not aware of a workaround as such, short of resolving this issue or https://github.com/tweag/rules_haskell/issues/1320.

aherrmann commented 2 years ago

The GHC bindists generated by the new Hadrian build systems are relocatable. Meaning, they are almost ready to use after unpacking the tarball, without the need to run ./configure && make && make install. This suggests a much simpler path forward than trying to fit the ./configure && make && make install steps into a regular Bazel build action.

However, there are a few limitations to this at the moment:

  1. As of now, Hadrian bindists are not yet published on the download page.
  2. The bindist still requires a settings file to function, which is not included in the bindist, but generated by the configure script and Makefile. For a functioning Bazel integration, we must generate this file.
  3. The bindist still requires patching invalid haddock paths. For a functioning Bazel integration, we must be sure to execute those steps.
GuillaumeGen commented 1 year ago

I am currently trying to replace change the bindist by one built using Hadrian (and supposed to be relocatable). I built the binary and created a tarball having exactly the same structure as the one of the Windows tarball distributed on the download page of GHC. My first attempt was to store this tarball on Google Drive, but I did not manage to find a download link for a tarball on the drive, so I am currently playing with a tarball on my computer. I currently have an issue when using it, the haddock docs is accessed with a .. and going to a parent repository is not allowed in Bazel target. The precise error message is:

ERROR: /home/guillaume/.cache/bazel/_bazel_guillaume/0b46565eb2a80f3140a716a5798b5112/external/rules_haskell_ghc_linux_amd64/BUILD:18:15: @rules_haskell_ghc_linux_amd64//:text: invalid label '../docs/html/libraries/text': target names may not contain up-level references '..'

As far as I have understood, the function responsible for generating this address is this call https://github.com/tweag/rules_haskell/blob/f3ff3b2bb73a44f752d124420ef30484905cf28b/haskell/private/pkgdb_to_bzl.py#L108 but even trying to change the pkgroot does not change the result of it, so I have to investigate further https://github.com/tweag/rules_haskell/blob/f3ff3b2bb73a44f752d124420ef30484905cf28b/haskell/private/pkgdb_to_bzl.py#L27

This function seems to replace the provided pkgroot by the topdir which is supposed to be the lib/ folder according to the comment at the top of the file. This is quite consistent with the fact that one has to go out of the lib/ folder to go to the docs/ one when looking for haddock files. But I do not understand why this issue does not occur when using the binary distributed for windows, which has exactly the same structure as the one I am trying to use now.