haskell / cabal

Official upstream development repository for Cabal and cabal-install
https://haskell.org/cabal
Other
1.6k stars 690 forks source link

[META] Regarding the release CI runners: GitHub or the Haskell GitLab? #8674

Open Kleidukos opened 1 year ago

Kleidukos commented 1 year ago

Due to a host of issues that have been noted through continuous usage of the Haskell GitLab CI by @hasufell, he suggests that we discuss a migration from the GitLab runners to GitHub's own runners.

The proposal (originally here) is transcribed below:


Migrating tooling to github / providing github action runners to Haskell infrastructure

Current state

  1. GHCup relies on Gitlab CI for testing and releases
  2. GHCup maintainers also use Gitlab CI to build darwin M1 stack binaries
  3. HLS and Cabal rely on Gitlab CI for releases only
  4. projects like bytestring rely on very slow emulation on github actions or other semi-private runners

The problem

  1. GHC Gitlab is focussed on providing working GHC binaries and environment for GHC developers. That makes other projects or less popular Systems (like FreeBSD) second class citizens. The current state of priorities was communicated here
  2. All runners on GHC Gitlab are custom and are often maintained by different parties. Getting support for broken runners is not always easy. In fact, I have a more direct line to Github staff (who are old Gentoo dev colleagues) than to GHC gitlab devops or runner maintainers. On Github, we would only maintain a subset of runners ourselves (e.g. darwin M1) instead of all of them.
  3. Maintaining two sets of CI configurations for HLS and Cabal is not only a maintenance burden, but also means that these projects don't regularly test their PRs and changes against ALL architectures, but only do so at release time (which may be too late). All those projects have communicated that they don't want to migrate fully to Gitlab last I asked (and actually proposed this).
  4. Ideally, all core tooling and core libraries should have access to all relevant architectures that GHC supports... on Github actions.

The solution

  1. migrate GHCup fully to Github
  2. migrate release CI of HLS and Cabal to Github
  3. buy our own runners (for e.g. darwin M1) with Opencollective money from GHCup and HLS and with help from HF funding and maintain them across our teams
    • darwin M1 prolly at 150$ per month per box (e.g. Macstadium)
    • AWS instances (FreeBSD and aarch64 Linux) for 730 hours (one month) somewhere around 50-150$ per box, depending on instance
    • initial cost may be 500-1000$ per month for a kickstart

HF has indicated (but not promised) that they may help with resources or funding of this solution.

Side effects

Remarks

Matthew's concerns on environment and reproducibility

Matthew was concernd about not using gitlab environments (probably the ci docker images) to match the environment where GHC is built.

Note from Julian: However, we don't do that in HLS CI anyway, at the moment, only partly.

Zubin remarked we can switch CI to gitlab without switching source control

HLS and Cabal could theoretically use gitlab CI exclusively for both testing and releases and then have hooks and integration with github.

Note from Julian: that would only fix problem 3.

Mikolaj commented 1 year ago

There were scattered bits of discussion of this proposal on #hackage a couple of days ago involving @hasufell, @chreekat, @gbaz and me, but copy-paste from matrix is unreadable, so perhaps that should be screen-shot or, better, somebody on IRC could copy-paste?

Edit: never mind, the participants seem happy to continue without the copied backlog.

Mikolaj commented 1 year ago

Related: https://github.com/haskell/cabal/issues/8044

Edit: was wrong link; corrected.

gbaz commented 1 year ago

Thanks for pasting this! While a good start, it doesn't really cover the current status of the new infrastructure which ghcup now uses, nor have an arch-to-arch comparison of which arches are supported on both setups at the moment.

hasufell commented 1 year ago

Thanks for pasting this! While a good start, it doesn't really cover the current status of the new infrastructure which ghcup now uses, nor have an arch-to-arch comparison of which arches are supported on both setups at the moment.

I think comparing for runners doesn't make any sense, because it does not reflect the end-user experience, which is the main issue.

Regardless, I tried to make a comparison. On gitlab, I simply filtered by runner tags. The tags are messy and confusing, so it's possible it doesn't show the correct picture.

Gitlab Github Remarks
x86_64-linux 13 +1000
x86_64-darwin 1 +1000
x86_64-windows 2 +1000
aarch64-linux 14 2
aarch64-darwin 5 2 these can also execute x86_64
x86_64-freebsd ? +1000 via Cirrus CI, which gitlab could use as well, theoretically

Two main things to note:

Mikolaj commented 1 year ago

Transcript of today's chat with @chreekat and @hasufell:

28/01/2023, 4:03:19 pm - mikolaj: chreekat: do you think you could find a moment at some point and advise re our gitlab config? Most of the builds fail and perhaps you would suggest a trivial fix or even add a line or two and solve all our problems? 
28/01/2023, 4:03:27 pm - mikolaj: it's this pipeline: https://gitlab.haskell.org/haskell/cabal/-/pipelines
28/01/2023, 4:03:42 pm - mikolaj: with this config: https://github.com/haskell/cabal/tree/master/.gitlab
28/01/2023, 4:03:58 pm - mikolaj: it was used for the cabal prerelease just prereleased
28/01/2023, 4:04:05 pm - mikolaj: with mixed effects
28/01/2023, 4:04:18 pm - mikolaj: meaning, it seems what builds, works, but not much builds
28/01/2023, 4:04:34 pm - mikolaj: thanks!
28/01/2023, 4:34:15 pm - chreekat: mikolaj: surveying the failed jobs, my guess is the following:

The two ARM Linux jobs (aarch64 and armv7) fail because of a missing glibc version. An oversight of the Docker image in use?

The i386 Alpine job failed because GHCUp can't find a GHC 9.2.3 that runs on that platform. Is it supported by GHC?

The FreeBSD job failed because the name of a package in Ports changed. The job sets up a fresh FreeBSD in Vagrant every time so it's sensitive to changes in the operating system like that. There may well be other problems there
28/01/2023, 4:34:45 pm - chreekat: That message originally had paragraphs, but I guess IRC happened to it 
28/01/2023, 4:40:22 pm - mikolaj: chreekat: thank you; is there a cheap way to fix those? E.g,. can I copy-paste their respective GHC configs and be done or better link to them so that their updates propagate?
28/01/2023, 4:42:57 pm - maerwald (@maerwald:libera.chat): use 9.2.2
28/01/2023, 4:44:07 pm - maerwald (@maerwald:libera.chat): and then complain to GHC HQ for bad platform coverage
28/01/2023, 4:47:58 pm - mikolaj: maerwald: thank you
28/01/2023, 4:48:09 pm - mikolaj: chreekat: do you mind if I copy this to https://github.com/haskell/cabal/issues/8674 ?
28/01/2023, 4:48:51 pm - mikolaj: we have to decide whether to completely move away from gitlab or not, but it's really hard to decide without any person permanently responsible for our CI
28/01/2023, 4:49:00 pm - chreekat: mikolaj: sure, ping me as well so I get an email
28/01/2023, 4:49:14 pm - chreekat: I will have to look into this more on Monday :)
28/01/2023, 4:49:28 pm - mikolaj: thank you so much
28/01/2023, 4:50:13 pm - mikolaj: (re decision: because we'd be deciding for that volunteer that we hope to find, without that volunteer having any say, so not much point)
28/01/2023, 4:50:17 pm - maerwald (@maerwald:libera.chat): Man of Letters: my offer doesn't stand indefinitely btw. Once I'm done with HLS CI, I probably won't have capacity or motivation for this anymore.
28/01/2023, 4:50:30 pm - mikolaj: understood
28/01/2023, 4:54:27 pm - mikolaj: BTW, this was all spurred by https://discourse.haskell.org/t/cabal-3-10-last-call-for-contributions/5322/6?u=mikolaj  --- we need all platforms even for prereleases [edit: and nightlies once we have them]
28/01/2023, 4:54:31 pm - maerwald (@maerwald:libera.chat): Mac runners are a *major* headache btw. So one of the primary goals is to have as few platforms as possible self-hosted. There's no reason to do that other than "github doesn't provide it"
Bodigrim commented 1 year ago

I don't see any good reason to stay on GitLab. GHC runners can barely cope with GHC workload itself, and despite all efforts I am yet to see an MR I do not have to babysit and rerun flaky jobs. Until recently the major benefit of GitLab was an access to ARM machines, but now thanks to @hasufell and @angerman github.org/haskell has them as well.

hasufell commented 1 year ago

Is this moving forward into any direction? Does cabal have any governing structure to make decisions?

All information is available, please make a decision.

chreekat commented 1 year ago

To share what I feel about the matter:

  1. I was asked to see what it takes to run Cabal release pipelines on GitLab. I was able to double the number of platforms built for Cabal 3.8.1. My aim was to build for every platform that GHC 9.6.1 is built for, and Centos7 is the only remaining holdout.

  2. While GHC CI is still unreliable, that's mostly because of GHC. I don't think GitLab is in that bad of a state.

  3. Recall that those ARM runners that are now available on github.org/haskell are the same runners used on GitLab, and thus have the same level of reliability.

  4. My professional intuition is that building Cabal releases on the same platforms as GHC releases is valuable.

  5. While Microsoft is a major sponsor of the Haskell Foundation, they're not paying us to use GitHub, and I have strategic concerns about relying heavily on systems outside our control when alternatives exist.

  6. Nonetheless, it looks like I'm the only person in the room with the ability and interest in supporting Cabal's CI on GitLab. Due to my bigger priorities, I can only promise ~1 day of work per release right now. Compared to the amount and quality of effort being put into GitHub CI, I am not sure it's the right decision to stick with GitLab.

I, also, am curious to know what decision is made and who has the power to make it.

hasufell commented 1 year ago

@chreekat GitHub provides virtually unlimited runners for most of the platforms. This allowed me in HLS to scale up the bindist builds for 5+ distros and 3+ versions of those and that PER GHC. In parallel.

One such build on gitlab would lock up all the existing Linux runners (around 50 I believe).

The point is to reduce the amount of self hosted runners as much as possible.

Mikolaj commented 1 year ago

@hasufell: this needs to be discussed (thank you for taking part). We are busy with 3.10 release. The governing structure is cabal devs (gathering fortnightly, e.g., today, details on #hackage) and the github repo maintainers and the release manager makes the final decision for the particular release.

hasufell commented 1 year ago

The governing structure is cabal devs (gathering fortnightly, e.g., today, details on #hackage)

What's the list of cabal devs?

and the github repo maintainers

Who is that? I happen to be org admin, does that make me repo maintainer?

and the release manager

Who is that and how are they appointed? Just volunteers raising their hand?

Kleidukos commented 1 year ago

@hasufell The following teams are related to cabal:

Mikolaj commented 1 year ago

@hasufell: We are making it up as we go. I'm the release manager for 3.10 and the default release manager for the time being, but I hope somebody else picks up 3.12 or 3.10.2.0.

hasufell commented 1 year ago

@hasufell: We are making it up as we go. I'm the release manager for 3.10 and the default release manager for the time being, but I hope somebody else picks up 3.12 or 3.10.2.0.

So how many of your current and past release managers know gitlab sufficiently enough to work on its CI?

Mikolaj commented 1 year ago

None. (BTW, I'm the only one left.)

Mikolaj commented 1 year ago

FYI: We haven't arrived at a decision yet and we are still embroiled in the 3.10 release. However, in the meeting on Thursday, we expressed the desire to keep both CIs running for the foreseeable future. After the skilful intervention by @chreekat the gitlab CI is again viable for releases and has the advantage of matching the environment in which GHC releases are built, so we'd definitely like to keep it. Even if we switch to github CI for releases, for all platforms or only those that gitlab does not support, we'd still want to keep the gitlab CI as a safety net --- we all know that CI tends to have outages of various kinds and being able to compare CI logs or binaries from both is valuable, e.g., when a binary fails for a user.

@hasufell: thank you for your patience.

hasufell commented 1 year ago

and has the advantage of matching the environment in which GHC releases are built

I've said it before, this isn't really true, because you can have the same environment on any another CI:

And besides... there's no evidence this is even relevant for cabal. Even for HLS it isn't, imo, and we're not doing it anymore.

If GHC CI relies on mutable runner state to function correctly, then it's broken (some darwin caveats apply).

Mikolaj commented 1 year ago

Yes, I understand the theory. But in practice, often neither will match the GHC setup accurately and I expect gitlab to keep up more easily.

hasufell commented 1 year ago

But in practice, often neither will match the GHC setup accurately

Do you have examples of this claim?

Also, can you provide a proper technical reason why this would be relevant? Or is this just a "gut feeling"/mysticism/etc.?

Mikolaj commented 1 year ago

This is gut feeling and guesswork all the way.

hasufell commented 1 year ago

This is gut feeling and guesswork all the way.

Well, I'm confident it's irrelevant to cabal, because we've built and distributed GHCup, cabal, stack and HLS outside of GHC CI environment without any issues.

GHC CI has no magical powers, except for breaking very frequently.

chreekat commented 1 year ago

I've staunched the bleeding of the GitLab CI in #8818 . I hope it helps!

P.S. @hasufell I'm sorry you disagree with the decision the other Cabal maintainers made, and I generally agree with your sentiments, but reasonable people can disagree on which CI system to use for Cabal. Unless you think you can move GHC itself to GitHub, there will always be very good reasons to stay on GitLab.

hasufell commented 1 year ago

I'm sorry you disagree with the decision the other Cabal maintainers made

You misread my comments.

I have no strong opinion for cabal as a project, I was simply offering my expertise for migration.

I was rather commenting on a specific argument that seems to have emerged from an unknown location and fosters the belief that there's something mystical about GHC gitlab and only binaries forged in its cradle, close to mother GHC, will be worthy to be shipped to users... or so it sounds.

Please, let's stick to... well, technical arguments.

There is no reason to build cabal on the exact same environment as GHC was built.

If there was such a reason... that would be a humongous disaster! And it would mean we'll all burn and our binaries are non-portable.

Unless you think you can move GHC itself to GitHub, there will always be very good reasons to stay on GitLab.

I personally think there is no technical reason for any non-GHC project that has no release CI yet to pick GitLab over GitHub (I worded this carefully).

Migrating GHC to GitHub seems like a great idea to improve collaboration and developer experience, but I'm afraid I'm not gonna offer my help with that, unless I get paid very well. I also think it's totally orthogonal to this discussion.

chreekat commented 1 year ago

There is no reason to build cabal on the exact same environment as GHC was built.

Well, I'm not the one saying this is so, and it's not why I support staying on GitLab. My "technical" argument is that GHC is on GitLab. That's, like, the entire thing.

But this isn't my fight so I will abstain from further comment.

Bodigrim commented 1 year ago

I usually build Cabal executables myself and AFAICT there is no special environment or system dependencies needed. Cabal is a pure Haskell application. So GHC build environment is irrelevant at best (and hiding portability issues at worst).

Two semi-functional CIs are not equal to one properly working, and from the discussion above it seems that Cabal team has expertise in neither of them. E. g., even a complete migration to GitLab would be more sustainable than testing PRs on one platform, but building releases on another.

That said, this is not my fight as well; JFTR that I share @hasufell's bewilderment.

gbaz commented 1 year ago

"I usually build Cabal executables myself and AFAICT there is no special environment or system dependencies needed"

The issue is simply that different linuxes may have different glibcs or the like, and building cabal on the exact same version avoids this. More generally, ghc targets its builds for a specific set of OSes and architectures. By staying as close as possible to ghc, then that means its logistically simpler to ensure that cabal targets that exact same set.

Nobody claims that this is impossible with a different CI setup -- its simply administratively easier when its all in one place, and everything is changed at once.

As far as I can tell, the concerns that might drive a switch basically stemmed from a period in which ghc team support seemed unresponsive or hard to deal with. Experience since has shown that the current maintainers of ghc ci infrastructure are extremely helpful and responsive, and do have time to devote.

Technical issues aside, I don't see any reason to take a headlong leap from a system where we are getting support (including partial time given by people employed to keep CI running for GHC) to another system where such support is entirely going to be on a volunteer basis. It seems less sustainable in the long term.

All of us would welcome work on the github CI, including migrating it to use more of the runners that already are part of haskell github initiatives. I don't think it makes sense to make this work conditional on necessarily planning to abandon the existing gitlab CI for releases. People that want to work on building out and improving the github CI are free to do so right now, and their efforts would be praised and endorsed and warmly welcomed.

hasufell commented 1 year ago

The issue is simply that different linuxes may have different glibcs or the like, and building cabal on the exact same version avoids this.

No. Cabal is built on alpine linux for x86_64 and fully statically linked against musl.

So this is inaccurate.

For the platforms where it matters (e.g. aarch64 linux), you simply need a deb10 docker container, not GHC CI environment.

gbaz commented 1 year ago

For the platforms where it matters (e.g. aarch64 linux), you simply need a deb10 docker container, not GHC CI environment.

At this moment, you need the deb10 container. At a prior moment, there was no deb10 and you needed a different container. At a future moment, we will need a different container again. I'm simply saying that keeping stuff all in one place for releases means that for releases these choices will be easier to keep synchronized over time.

hasufell commented 1 year ago

At this moment, you need the deb10 container. At a prior moment, there was no deb10 and you needed a different container. At a future moment, we will need a different

Again: I maintained the gitlab cabal CI configuration.

First of all: you have this potential problem regardless of the CI system. I'm puzzled. We're not running cabal CI through GHCs CI scripts. There's no automatic synchronization. What would that even look like? I can't imagine.

You need to understand the requirements of building and shipping cabal. If you don't, GitLab will not save your day.

Secondly, for the given example of linux aarch64 you'll be able to stick to deb10 as long as it's not EOL (and longer with older GHCs). That's when the next GHC release will stop producing bindists for deb10. You'll notice because your build simply fails, since GHC can't be invoked with such a new GLIBC. There won't be any miscompilation or problematic linking.

So I'd really urge the cabal or release maintainers to invest time in understanding these things and stop hoping that sticking to GitLab will absolve them from doing so.

Mikolaj commented 1 year ago

We are still handling the fallout of the 3.10 release, but let me just hint at the new ideas from a chat by @Kleidukos, @chreekat, @hasufell and me that could help us move forward with gradually upgrading cabal CI to ghcup standards, but at the same time preserve cabal maintainers' ability to gracefully degrade to lower standards in case of infrastructure failures and churn in cabal or other teams (we've had a fair share of both kinds of these disasters in recent years, hence the insistent unpleasant feeling in our guts at the thought of one-way big technical leaps).

The aforementioned high ghcup technical standards include, AFAICT, better tools for ad-hoc binary builds, at least 3 more platforms, testing the binaries that get released, supplying users with tests they can perform on their own machine (WIP) and certainly more.

The ideas for a gradual reform, AFAICT, are to improve the situation from 3 angles, before doing the big jump to releasing from github CI (if we decide to do so, while cheaply maintaining gitlab as the backup option, as long as it's cheaply possible). The 3 angles are the existing gitlab cabal CI (binary builds), existing github cabal CI (tests), and the GHC gitlab CI (upstream support for more platforms). The goal of the improvements to the two cabal CIs would be, in particular, to make them less dependent on whether they run on gitlab or github and easier to use ad hoc. Eventually, the "big jump" to github would no longer be so big and would be reversible and component-wise degradable (to ad-hoc manual builds at worst) in case of any natural disasters.

Please correct any inaccuracies and please discuss.

Edit: Related to he GHC angle: https://gitlab.haskell.org/groups/ghc/-/epics/5