haskell-actions / setup

Setting up GHC, cabal, stack on for Haskell-related CIs
MIT License
35 stars 12 forks source link

XDG: Broken cabal installation in macos 20240105 GHA runners #62

Open jasagredo opened 5 months ago

jasagredo commented 5 months ago

These macos runners come with existing XDG directories for cabal. Since the start of the runner, the following files are present:

In this haskell-actions/setup there is a deliberate choice (decided in haskell/actions#210) to create ~/.cabal/bin (here) in order to disable XDG, however, cabal will still think it is using XDG because of the existence of ~/.config/cabal. See this logic.

I do not know which one should be the path forward:

  1. This action relocates/deletes those files in the MacOS runners if they exist.
  2. This action deletes those files in all runners if they exist to disable XDG always.
  3. Users of this action should take care of this when using MacOS runners.
  4. The MacOS runners should not create those files in the first place.
  5. The logic in Cabal should be made more accurate to even deal with cases where ~/.cabal and ~/.config/cabal/config exist but not ~/.cabal/config (as is happening here).

The 20240105 image is still very new, only 9.79% of rollout progress as I'm writing this, but this still has resulted in CI pipelines failing for example in Cabal itself. 1 2 and more if you look at the Validate workflow.

hasufell commented 5 months ago

I told you it's not a good idea to enable/disable XDG based on directory existence :)

Mikolaj commented 5 months ago

There is one more alternative, due to

If the CABAL_DIR environment variable is set, the indicated directory will be used to store all Cabal-related files, as in previous versions.

(e.g., see https://github.com/haskell/actions/issues/210). So, a user of this action could set this variable and be safe.

I vote for all of above. I don't see a reason the choice should be made by the [edit: images or the] runner and when it is, let's revert it early.

andreasabel commented 5 months ago

There is one more alternative, due to

If the CABAL_DIR environment variable is set, the indicated directory will be used to store all Cabal-related files, as in previous versions.

So maybe simply the action should set CABAL_DIR at the start to point to ~/.cabal and then XDG would be totally disabled?

Mikolaj commented 5 months ago

So maybe simply the action should set CABAL_DIR at the start to point to ~/.cabal and then XDG would be totally disabled?

I think @jasagredo proposed that and it sounds fine. Then, if people want to switch their Haskell CI to XDG, creating the XDG directory won't work, they'd need to set the CABAL_DIR accordingly. A bit more of a hassle, but less likely to be done by accident, either by the CI users or by the container image creators.

hasufell commented 5 months ago

Doesn't seem like a good idea to set this variable. It'll cause more confusion.

Mikolaj commented 5 months ago

Doesn't seem like a good idea to set this variable. It'll cause more confusion.

What kind of confusion? By people that already switched their CI to XDG and it would switch their setup back? Are there any such brave souls? Or in the future, if the cabal manual does not mention the variable prominently enough (does it?)?

hasufell commented 5 months ago

What kind of confusion?

This changes the behavior of the action.

What is the interplay with users setting the env var somewhere in the action? Does the action pick it up or overwrite it? Will it now detect the store-dir properly?

https://github.com/haskell-actions/setup/blob/b1dc5f1786a382e9a2532d039e8c53cf56f21392/src/setup-haskell.ts#L77C1-L77C1

What is the interplay with users not setting the env var in the action?

How do users switch to XDG or away from XDG?


Avoid pre-setting more user-exposed env vars. Have sane defaults. Don't make decisions based on directory existence.

andreasabel commented 5 months ago

One problem we are facing with the XDG migration is that users might have hardwired the path ~/.cabal/store into their caching step. Failure of caching does not trigger failure of the action, so users might for a long time not notice that their caching of dependencies is non-functional when we switch this action to XDG based on a pre-installed XDG-cabal.

The goal is thus to force cabal away from XDG.

I don't see a way to achieving this goal (in our multi-cabal context) without setting CABAL_DIR or removing the XDG cabal directories, but I am happy to learn otherwise.

Of course, forcing cabal to use the old directory structure means that we hardcode it here in the action, which is an ugly thing to do.

@hasufell if you have an idea how to do it otherwise, I'd happy to learn, but please be concrete, as I am failing to understand how you envision this to work atm.

hasufell commented 5 months ago

One problem we are facing with the XDG migration is that users might have hardwired the path ~/.cabal/store into their caching step. Failure of caching does not trigger failure of the action, so users might for a long time not notice that their caching of dependencies is non-functional when we switch this action to XDG based on a pre-installed XDG-cabal.

I don't think it's this actions responsibility to keep up with every breakage that cabal or ghc cause.

That said, I think the current approach here isn't proper either:

https://github.com/haskell-actions/setup/blob/b1dc5f1786a382e9a2532d039e8c53cf56f21392/src/setup-haskell.ts#L73-L78

I would:

  1. stop creating any files or folders on unix and macOS
  2. set CABAL_DIR on windows only (I'm hoping cabal doesn't try to do XDG there)
  3. re-implement the logic to return/detect store-dir (I know cabal doesn't have a proper interface, but it can be done)
  4. tell users to use cabal-store for their caching instead of hardcoding paths?
  5. comment on cabal issue tracker that the current XDG logic was a mistake
Mikolaj commented 5 months ago

I'm absolutely fine to handle step 5 and announce anywhere and everywhere I've made a mistake. However, we still need volunteers to handle the other tasks, which includes a design discussion for store-dir, revisiting the XDG design discussion with the original author and others than took part in it and agreed on the design, the actual implementations, transition plan, etc. I don't think this is going to be ready any time soon.

For now, I'm guessing, haskell-actions still requires a way to work with the current XDG implementation and prevent both the caching problem Andreas mentions and the CI breakage we just experienced. Making sure (adding checks? adding rm -rf?) the XDG directories are not passed to the user is one way. Setting CABAL_DIR is another. I don't have an informed opinion on which is better.

Whatever the decision, I don't recommend migrating haskell-actions to XDG in the near future. The original plan was that the details of XDG are subject to change and that we count on feedback from brave users [edit: or new users that don't care, since XDG is the default for new setups] enabling it for their particular projects, not from tools or systems that'd move many users to XDG at once and at unawares. As expected, unforeseen problems emerge and that may not be the end of them.

athas commented 5 months ago

One way to simplify this is to remove the automatic backwards compatibility code in cabal and always use XDG unless CABAL_DIR is set.

hasufell commented 5 months ago

One way to simplify this is to remove the automatic backwards compatibility code in cabal and always use XDG unless CABAL_DIR is set.

This is going to cause even more problems. Please don't.

athas commented 5 months ago

I suggest setting CABAL_DIR=$HOME/.cabal. This will make cabal behave as it did before the XDG changes (including sensitivity to users further changing CABAL_DIR somewhere in their action). Using XDG can work fine in CI, but carries no advantages.

hasufell commented 5 months ago

Please open an issue at https://github.com/actions/runner-images/issues and investigate why XDG is enabled automatically.

It's possible that it has to do with brew doing shenanigans. The actual macOS ghcup provisioning does not enable XDG: https://github.com/actions/runner-images/blob/main/images/macos/scripts/build/install-haskell.sh

Maybe related: https://github.com/actions/runner-images/pull/8778

athas commented 5 months ago

If something runs cabal update along the way, that'd explain why the config file is created. I don't see that in those scripts, but I don't really understand how the infrastructure works.

hasufell commented 5 months ago

If something runs cabal update along the way

That's likely the installation of yq then, which depends on pandoc and hence ghc/cabal:


edit: well, scratch that, of course ghcup also runs cabal update: https://github.com/haskell/ghcup-hs/blob/55030d83da6d59196df28c0eb115cdf19ff20393/scripts/bootstrap/bootstrap-haskell#L823

athas commented 5 months ago

That's true, but Homebrew resets $HOME and such to a temporary directory when building a formula. It should not have any effect on the user's real home directory and cabal state. It certainly didn't when I (re-)wrote Homebrew's Haskell infrastructure a few years ago, but I no longer have a macOS system to test it. Homebrew's sandboxing during building is quite crude, and there may actually be a risk that the XDG change in cabal broke it.

Edit: oh yeah, that ghcup command is definitely the cause.

hasufell commented 5 months ago

oh yeah, that ghcup command is definitely the cause

What is the proper way of disabling XDG during cabal update (unless it's already enabled)?

athas commented 5 months ago

Same as any other cabal command: set CABAL_DIR or point CABAL_CONFIG to a config file that sets all directories explicitly.

hasufell commented 5 months ago

Same as any other cabal command: set CABAL_DIR or point CABAL_CONFIG to a config file that sets all directories explicitly.

Those are both poor options, imo.

I've decided to attempt this: https://github.com/haskell/ghcup-hs/pull/977

If that works correctly, then the github runners will fix themselves in the next deployment. It might make sense to add a test-case though to check that XDG is indeed disabled: https://github.com/actions/runner-images/blob/main/images/macos/scripts/tests/Haskell.Tests.ps1

hasufell commented 5 months ago

So it was probably the recommended tag bump from 3.6.x.x to 3.10.x.x that lead to this whole thing, because previously the bootstrap script would install 3.6 and then run cabal update with that old version.

athas commented 5 months ago

That seems like it has the exact same behaviour as setting CABAL_DIR=$HOME/.cabal. Am I missing something?

hasufell commented 5 months ago

That seems like it has the exact same behaviour as setting CABAL_DIR=$HOME/.cabal. Am I missing something?

The difference is that I don't set CABAL_DIR.

hasufell commented 5 months ago

https://github.com/haskell/ghcup-hs/pull/977 has been merged and deployed, which means the next time the github runners are deployed, they should pick up the change (which is every 2 weeks or so).

andreasabel commented 5 months ago

I have now also been bitten by this: https://github.com/haskell/alex/actions/runs/7680499933/job/20932609305?pr=257

Image: macos-12 Version: 20240119.1