Open aherrmann opened 4 years ago
We could solve that by maintaining maps from GHC (version, platform) -> partial file list. Even host that on a server somewhere if not in rules_haskell itself. After all, we're already doing something similar by hardcoding the SHA's of each bindist we know about. Or we could maintain pre-configured and relocatable bindists, and have the bindist rules use that instead.
We could solve that by maintaining maps from GHC (version, platform) -> partial file list. Even host that on a server somewhere if not in rules_haskell itself. After all, we're already doing something similar by hardcoding the SHA's of each bindist we know about.
Yes, hosting these in rules_haskell itself is what I had in mind with
approach would be to let the installation step run in a mode where it produces only metadata: the list of files, package dependencies, etc. (essentially what
pkgdb_to_bzl.py
determines). This metadata could then be checked in (similar to a lock file) and be used to predict the outputs of the installation step going forward. We could ship the lock files for the Linux and MacOS GHC bindists that are supported by rules_haskell as part of rules_haskell.
Hosting them separately is an interesting idea. That would avoid bloating the repo too much. We could amend the bindist mapping with URLs and hashes of these hosted metadata files.
Hosting them separately is an interesting idea. That would avoid bloating the repo too much. We could amend the bindist mapping with URLs and hashes of these hosted metadata files.
The only downside is that it introduces an extra point of failure. But the bindists are downloaded from the network anyways, so requiring another network download for metadata should be ok.
How much bloat are we talking here? A few Mbs should be totally feasible, since the rule users only fetch the tarball, not the complete git history.
On Wed, Aug 5, 2020 at 12:00 PM Mathieu Boespflug notifications@github.com wrote:
Hosting them separately is an interesting idea. That would avoid bloating the repo too much. We could amend the bindist mapping with URLs and hashes of these hosted metadata files.
The only downside is that it introduces an extra point of failure. But the bindists are downloaded from the network anyways, so requiring another network download for metadata should be ok.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tweag/rules_haskell/issues/1393#issuecomment-669102633, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYB5ZXZ7TN47ASROPVCNQTR7EUVNANCNFSM4OYTLAJA .
How much bloat are we talking here? A few Mbs should be totally feasible, since the rule users only fetch the tarball, not the complete git history.
Listing the files of the installed GHC 8.8.2 bindist requires about 500k, adding metadata like package dependencies shouldn't take much more. rules_haskell currently lists bindists for 15 GHC versions on two Unix OSs (we don't need this for Windows). That adds up to ~15M. The rules_haskell repository currently takes up 3M uncompressed, so that would be a pretty large increase in size.
If we gzip
the file list we get down to ~40k lets say 50k to be conservative. Then this adds up to 1.5M. That's still a pretty large increase compared to the current size of rules_haskell, but much less bad. We could also consider pruning the list of GHC bindists to further reduce the size, currently it goes back all the way to 7.10.3.
Interesting numbers. Ideally we wouldn't need to list all files. Just know where they are given the version number and the platform. Perhaps this could be achieved with ctx.actions.declare_directory()
. The problem with the latter is that it can't overlap with ctx.actions.declare_file()
targets, which we need for each of the individual ghc commands we want to call. But maybe those could be symlinks to somewhere inside the declared directory (as noted in @aherrmann's previous comment, Windows isn't necessary anyways).
Another thought that came to mind - if Windows doesn't need ./configure
, why does Linux/macOS? We could pre-configure relocatable bindists and host them on GitHub or elsewhere. i.e. bypass the official bindists entirely.
Ideally we wouldn't need to list all files. Just know where they are given the version number and the platform. Perhaps this could be achieved with
ctx.actions.declare_directory()
. The problem with the latter is that it can't overlap withctx.actions.declare_file()
targets, which we need for each of the individual ghc commands we want to call. But maybe those could be symlinks to somewhere inside the declared directory
We also need libraries for CcInfo
and .haddock
files for HaddockInfo
. These can share directories with .hi
or .html
files respectively. I'm not sure if it's worth the additional complexity.
Another thought that came to mind - if Windows doesn't need
./configure
, why does Linux/macOS? We could pre-configure relocatable bindists and host them on GitHub or elsewhere. i.e. bypass the official bindists entirely.
That's a very good question. If it's possible to pre-configure relocatable bindists for Unix then that sounds like a great solution. If this is possible, it may also be worth asking why the upstream GHC bindists aren't configured that way in the first place.
I asked the question on ghc-dev@: https://mail.haskell.org/pipermail/ghc-devs/2020-August/019126.html.
The thread seems to indicate that having upstream ship relocatable bindists with a much simpler configure step that only writes out the settings
file is doable. I think we ought to work with upstream to solve the problem at that level. (See the golden rule of software quality.)
I think we ought to work with upstream to solve the problem at that level. (See the golden rule of software quality.)
I love how this blogpost is already being linked on discussions like these, I had the same thought when I read Andreas’ comment above.
Though https://mail.haskell.org/pipermail/ghc-devs/2020-August/019128.html makes me suspect that until Hadrian hits we can gain a lot by shipping our own pre-extracted binary releases for now, just push to an s3 bucket from CI?
Especially since upstream is worrying about super-edge cases like AIX, and afaik none of the users of rules_haskell uses AIX (I didn’t even know this unix variant still existed tbh).
Has this issue been abandoned? It's almost two years since the last update.
It takes our CI (circleci+runners on GCP, ubuntu) over a minute to fetch the compiler. Since every CI job is run on a new instance, we pay that penalty several times an hour. Can you suggest a workaround, even if it is only for linux?
@symbiont-ji It's something that we would still like to implement, and I think it would be an important feature. However, it's also a non-trivial issue.
Can you suggest a workaround, even if it is only for linux?
Many rules_haskell users on Linux use Nix to provision GHC. In that case this is not an issue. I'm not aware of a workaround as such, short of resolving this issue or https://github.com/tweag/rules_haskell/issues/1320.
The GHC bindists generated by the new Hadrian build systems are relocatable. Meaning, they are almost ready to use after unpacking the tarball, without the need to run ./configure && make && make install
. This suggests a much simpler path forward than trying to fit the ./configure && make && make install
steps into a regular Bazel build action.
However, there are a few limitations to this at the moment:
settings
file to function, which is not included in the bindist, but generated by the configure script and Makefile. For a functioning Bazel integration, we must generate this file.I am currently trying to replace change the bindist by one built using Hadrian (and supposed to be relocatable).
I built the binary and created a tarball having exactly the same structure as the one of the Windows tarball distributed on the download page of GHC.
My first attempt was to store this tarball on Google Drive, but I did not manage to find a download link for a tarball on the drive, so I am currently playing with a tarball on my computer.
I currently have an issue when using it, the haddock docs is accessed with a ..
and going to a parent repository is not allowed in Bazel target.
The precise error message is:
ERROR: /home/guillaume/.cache/bazel/_bazel_guillaume/0b46565eb2a80f3140a716a5798b5112/external/rules_haskell_ghc_linux_amd64/BUILD:18:15: @rules_haskell_ghc_linux_amd64//:text: invalid label '../docs/html/libraries/text': target names may not contain up-level references '..'
As far as I have understood, the function responsible for generating this address is this call https://github.com/tweag/rules_haskell/blob/f3ff3b2bb73a44f752d124420ef30484905cf28b/haskell/private/pkgdb_to_bzl.py#L108 but even trying to change the pkgroot
does not change the result of it, so I have to investigate further https://github.com/tweag/rules_haskell/blob/f3ff3b2bb73a44f752d124420ef30484905cf28b/haskell/private/pkgdb_to_bzl.py#L27
This function seems to replace the provided pkgroot
by the topdir
which is supposed to be the lib/
folder according to the comment at the top of the file. This is quite consistent with the fact that one has to go out of the lib/
folder to go to the docs/
one when looking for haddock files. But I do not understand why this issue does not occur when using the binary distributed for windows, which has exactly the same structure as the one I am trying to use now.
Is your feature request related to a problem? Please describe. The GHC bindist on Unix requires a
./configure && make && make install
step before the toolchain can be used. Currently,haskell_register_ghc_bindists
executes this installation in a repository rule. This means that this step is not sandboxed and cannot be cached by Bazel (the repository cache is limited to downloads). Additionally, repository rules don't have access to Bazel's toolchain resolution, so we cannot point GHC to the correct CC toolchain during./configure
and instead it will find whichever toolchain is available in the environment.Describe the solution you'd like I would like for the
./configure && make && make install
step to happen in a regular sandboxed build action. This would make this installation step cacheable and allow us to control which CC toolchain GHC discovers at this step.Describe alternatives you've considered Users can avoid these issues today by using a nixpkgs provided GHC instead.
Another approach would be to pull in a system installed GHC from outside Bazel as described in https://github.com/tweag/rules_haskell/issues/1320. I think it would be good if rules_haskell supported both a hermetic
haskell_register_ghc_bindists
as well ashaskell_register_ghc_host
.Additional context One difficulty is that we will need to predict which files this installation step produces exactly, due to the lack of dynamic dependencies in Bazel. This will depend on the platform and GHC version.
One approach would be to let the installation step run in a mode where it produces only metadata: the list of files, package dependencies, etc. (essentially what
pkgdb_to_bzl.py
determines). This metadata could then be checked in (similar to a lock file) and be used to predict the outputs of the installation step going forward. We could ship the lock files for the Linux and MacOS GHC bindists that are supported by rules_haskell as part of rules_haskell. This way this approach would not change the current API.Note, the GHC bindist on Windows does not require any
./configure
ormake
steps and is useable immediately after unpacking.