haskell / cabal

Official upstream development repository for Cabal and cabal-install
https://haskell.org/cabal
Other
1.61k stars 691 forks source link

Interrupting cabal-3.8.0.0.20220526 and master with ctrl+c makes it hang on windows #8208

Open jneira opened 2 years ago

jneira commented 2 years ago

I am experiencing some weird behaviour interrupting cabal rc execution with ctrl+c in windows: Doing a ctrl+c in the Configuring component for my package, with ghc-pkg subprocesses, hangs the program with rc1

I cant reproduce with 3.6.0.0

Mikolaj commented 2 years ago

Possibly introduced by #7921, even though it says "on unix systems".

could you try building cabal with process 1.6.14 or later? building with GHC 9.2 may include that by default I don't remember which GHC gitlab CI uses and so which which GHC the cabal binaries have been built nor which GHC ghcup uses to build RC1 cabal anyway, that's a long shot, process < 1.6.14 had some faults, but possibly not as grave as hanging

jneira commented 2 years ago

cabal in gitlab is built with ghc-8.10.7 iirc

jneira commented 2 years ago

I checked that pr and I did not get any hang, tests could be wrong though https://github.com/haskell/cabal/pull/7921#issuecomment-1064897302

And a confirmation of the bug by another windows user would be great, i will try again with other projects myself

Mikolaj commented 2 years ago

Yes, dear Windows users, please reproduce.

robx commented 2 years ago

The "on unix systems" is just to say that the change isn't expected to necessarily fix things on Windows. It definitely does touch Windows too, and there's a decent chance of something like this slipping through...

One factor that might come into play is that GHC has two different Windows I/O managers, which behave differently with respect to types and sub-processes, see e.g. here https://github.com/haskell/process/issues/235. It might be worth checking if that is involved here. (I don't really know which GHC version uses which I/O manager by default and/or whether they're switchable...)

jneira commented 2 years ago

in my test I used ghc-8.10.7, which did not have the new io manager cabal release itself has been built with ghc-8.10.7 afair that new io manager is still not the default in ghc-9.2

Mikolaj commented 2 years ago

FYI: I'm getting "waitForProcess: does not exist (No child process)" when interrupting long running cabal test, on an ancient Ubuntu. I don't remember getting that before 3.8, but my memory is unreliable. However, this is obviously better than accumulating half-dead processes.

We still have to decide whether to revert for 3.8, e.g., waiting until the new GHC IO fixes the problem or until somebody fixes it differently or until enough Windows users can't reproduce it and so a fluke on a single Windows system is a valid suspicion.

An argument for not reverting is that we only got one report of the problem for the pre-release. However, not many Windows users are early adopters so that may be a sampling bias.

jneira commented 2 years ago

We even are not sure what is the cause (if it is reproducible at all!). Small evidence to perform any action or delay de release imho. I am afraid only the release will give us more feedback 🤷

robx commented 2 years ago

It's a bit tangential to this particular issue, but how about generally building releases with GHC 9.2? That would address those "no child process" errors for the released cabal versions at least.

Regarding Windows issues in general, it'd be great if there were some easy way for non-Windows devs to get access to a Windows dev environment. Is there some organisation that could e.g. provide access to Windows VMs to contributors of core Haskell tooling?

Mikolaj commented 2 years ago

@hasufell: in your experience, is GHC 9.2.3 stable enough to build cabal 3.8.1 with it (that ghcup would later distribute)? I'm in favour, to avoid both of stale processes and "waitForProcess: does not exist (No child process)". I suppose changing GHC_VERSION in .gitlab-ci.yml would be enough to effect the change?

If anybody has a clue about free Windows VMs for cabal devs, please let us know. @bgamari: do you think HF or GHC HQ may have any?

hasufell commented 2 years ago

in your experience, is GHC 9.2.3 stable enough to build cabal 3.8.1

I don't use 9.2.3 actively, so I don't know.

Mikolaj commented 2 years ago

But don't your users complain to you about particular GHCs and demand their money back?

hasufell commented 2 years ago

But don't your users complain to you about particular GHCs and demand their money back?

Constantly.

marcus

Mikolaj commented 2 years ago

I take it as GHC 9.2.3 not being particularly maligned by the users, so let's try #8271.

Mikolaj commented 2 years ago

@jneira: I've compiled branch 3.8 with GHC 9.2.3. Could you try if your problem persists? https://gitlab.haskell.org/haskell/cabal/-/pipelines/54283

@robx: I'm still getting "waitForProcess: does not exist (No child process)" [edit: most of the time] with branch 3.8 cabal compiled on GHC 9.2.3 (https://gitlab.haskell.org/haskell/cabal/-/jobs/1106009) and I'm not ever getting that with cabal 3.6.2. Is that just my ancient Ubuntu acting up? Could you repeat your tests with this version?

Edit: and both 3.6.2 from ghcup (which doesn't show the waitForProcess message for me) and the newly compiled cabal are said by ldd to be static exes, so probably both come from the gitlab-ci job build-x86_64-linux-alpine, so that's probably not the cause of the differences.

robx commented 2 years ago

@robx: I'm still getting "waitForProcess: does not exist (No child process)" [edit: most of the time] with branch 3.8 cabal compiled on GHC 9.2.3 (https://gitlab.haskell.org/haskell/cabal/-/jobs/1106009) and I'm not ever getting that with cabal 3.6.2. Is that just my ancient Ubuntu acting up? Could you repeat your tests with this version?

I was under the mistaken impression that GHC 9.2.3 shipped with the fixed version of the process package. Unfortunately, that turns out not to be the case: It ships with process-1.6.13.2 while the fix is in process-1.6.14.0. Sorry about that :/ (though it's probably still a good change to build with GHC 9.2).

Mikolaj commented 2 years ago

though it's probably still a good change to build with GHC 9.2

Yes, I think so.

hasufell commented 2 years ago

It's trivial to build against a newer core library. You do not need to update GHC for that.

jneira commented 2 years ago

Regarding Windows issues in general, it'd be great if there were some easy way for non-Windows devs to get access to a Windows dev environment. Is there some organisation that could e.g. provide access to Windows VMs to contributors of core Haskell tooling?

well at least we have free windows machines in ci (GitHub and gitlab) But it is hard to reproduce the hang programatically, maybe killing via cli long running cabal processes at a random points?

Mikolaj commented 2 years ago

I've just tried cabal from https://gitlab.haskell.org/haskell/cabal/-/jobs/1108769 and I no longer see the spurious process messages. [edit: thanks to #8279]

@jneira: I wonder if the Windows binary from that pipeline (but different job) hangs still?

Mikolaj commented 2 years ago

@jneira: specifically, this Windows job: https://gitlab.haskell.org/haskell/cabal/-/jobs/1108774 or just get it with ghcup (cargo-culting from @hasufell):

ghcup --no-cache install cabal -u 'https://gitlab.haskell.org/haskell/cabal/-/jobs/1108774/artifacts/raw/out/cabal-install-3.8.0.20220526-x86_64-windows.zip?inline=false' head

hasufell commented 2 years ago

@Mikolaj is this a proper prerelease?

ulysses4ever commented 2 years ago

Maybe we should create a process when cabal maintainers can submit new entries to the ghcup prelease channel.

hasufell commented 2 years ago

Maybe we should create a process when cabal maintainers can submit new entries to the ghcup prelease channel.

Whenever they want.

ulysses4ever commented 2 years ago

Right, but is there a description of how to do that? I see "Adding a new GHC version" on ghc-metadata repo. Maybe it could be extended to mention prerelease channel and cabal.

Mikolaj commented 2 years ago

@hasufell: no, this is a fake pre-release. In particular, the few new changes are not in the release notes and the release name is unchanged vs the previous real rc1.

Mikolaj commented 2 years ago

Re ghcup process, this one only involves me, Julian and a certain volume of beer, so probably not worth codifying yet.

Edit: I misunderstood it's about documenting in Cabal docs. Please disregard.

ulysses4ever commented 2 years ago

I actually meant both: ghcup and cabal docs. But if you think it's fine in the current beer-based form, I will not argue otherwise!

jneira commented 2 years ago

I don't have much time lately, will try to check this version but don't wait for me

Mikolaj commented 2 years ago

@jneira: thank you!

jneira commented 2 years ago

Friendly reminder, you dont need build from source or use ghcup to try a cabal executable from any pr, the github validate workflow make available the cabal executable used to run the test suites as artifacts of the workflow. For example for the pr backporting the process bump up for 3,8 you can download the executables here: https://github.com/haskell/cabal/actions/runs/2637190469#artifacts

jneira commented 2 years ago

I've tried such executable and i am afraid ctrl+c continue not working 😢

Will try to check a version reverting #7921 just in case

bgamari commented 2 years ago

@hasufell: in your experience, is GHC 9.2.3 stable enough to build cabal 3.8.1 with it (that ghcup would later distribute)? I'm in favour, to avoid both of stale processes and "waitForProcess: does not exist (No child process)". I suppose changing GHC_VERSION in .gitlab-ci.yml would be enough to effect the change?

If anybody has a clue about free Windows VMs for cabal devs, please let us know. @bgamari: do you think HF or GHC HQ may have any?

I just use qemu; the installation media is freely available from https://www.microsoft.com/en-us/software-download/windows10. IIRC, you are free to use it for a period of a few weeks. After that you are supposed to activate the installation but a friend told me that if you don't nothing particularly terrible happens. I have a few notes on how I configure the VM here.

Mikolaj commented 2 years ago

@robx: ^^^. Also, @jneira warns about it here: https://github.com/haskell/cabal/issues/8230#issuecomment-1186120713

@bgamari: thank you.

jneira commented 2 years ago

Ok, i've checked branch 3.8 with #7291 reverted (with the executable built by this pr in my repo) ,and cabal continues ignoring ctrl+c so @robx pr seems to be innocent and my tests in that pr were correct 😄

But the problem is present and the fact in my env one version works and the newer one does not worries me. We dont have suspects now though 😟

jneira commented 2 years ago

But the problem is present and the fact in my env one version works and the newer one does not worries me. We dont have suspects now though 😟

Maybe #7995 @robx? It was a refactor afaics but i did not check that one

EDIT: I am reverting #7995 just in case in the mentioned pr https://github.com/jneira/cabal/pull/13, will check locally

jneira commented 2 years ago

Ok, so i've tried reverting #7995 and ctrl+c works again (executable here). I would like to provide a more substantial evidence than a local test but setup a test simulating it in windows will take quite time (at least for me)

@robx i would like to examine the pr and find the cause to not having to revert it but i will not have time to not delay the release excessively, what do you think about?

robx commented 2 years ago

Thanks for your work narrowing this down!

The only vague idea of the cause that I have is that the PR changes things to apply enable_process_jobs a bit more consistently (which I understood to be desirable/harmless, but...).

Two things I'd want to do to get closer to the issue are:

Unfortunately I can't promise a timeline on getting a Windows development environment set up. :/

Mikolaj commented 2 years ago

Guys, you rock, this bug could haunt us for years and you almost nailed it already. Please take your time. The only semi-urgent decision is whether to revert for 3.8 (we'd ideally release in 2 weeks). Let's try to make this decision before the weekend. @robx: I will defer to you on that.

hasufell commented 2 years ago

Unfortunately I can't promise a timeline on getting a Windows development environment set up. :/

I've had success compiling stuff in wine with ghc-8.10.7... the main problem is that it lacks msys2 support, so packages that have configure scripts or need system libs won't build. Haven't looked much deeper into it. But I was able to build 80% of cabals dependencies.

bgamari commented 2 years ago

Ccing @Mistuke, who may also be interested in this.

jneira commented 2 years ago
  • enableProcessJobs

Two things I'd want to do to get closer to the issue are:

* bisect the commits in the PR (that hunch points at [a5986e3](https://github.com/haskell/cabal/commit/a5986e37f45d3fca329fa2a0bf03d659b2f7354d))

* disable process jobs, by removing the `enableProcessJobs` call here:

@robx

Your hypothesis was right, removing enableProcessJobs makes cabal honour ctrl-c again in my windows 10. It would be great to know the root cause and why process jobs does not work with windows > 7 and process > 1 6.9 (1.6.14 in my tests) But in the meanwhile could we simply removing enableProcessJobs?

Mikolaj commented 2 years ago

Oh, yes, let's please decide. I'd love to tag the release ASAP so that we can give it the last round of dog-fooding, while the fine GHC folk already use the tag for their own intricate purposes.

jneira commented 2 years ago

We should open an issue upstream in any case but i opened #8312 just in case we decide to removing the function (for now?)

robx commented 2 years ago

Thank you so much for following up on this!

I've traced the enableProcessJobs code back to commit https://github.com/haskell/cabal/commit/6e88f4abf63490b9347735a36f0110d2b17075d6, maybe @bgamari could weigh in on whether we're likely to be breaking things with #8312?

For context, my understanding while working on #7995 was that enableProcessJobs is something we always want to do on Windows, and if we didn't in some cases, that was an accident. Not so sure anymore, clearly. But regardless, if it can cause hangs in some cases and removing it doesn't obviously break things, it seems like a good choice to remove it...

EDIT: Some more discussion of process jobs in this ticket: https://github.com/haskell/cabal/issues/7309.

Mikolaj commented 2 years ago

@jneira: I might have missed the detail: on which versions of Windows did you manage to reproduce the problem?

Mikolaj commented 2 years ago

Am I reading this right that https://github.com/haskell/cabal/pull/7844#issuecomment-983595388 says that our change in https://github.com/haskell/cabal/pull/8312 may cause "race conditions " and that the code we remove there warns that without it "we may see sporadic build failures without jobs"? And that's for all Windows versions (but for Windows 7 the same risk was there already with original code, so no difference).

If so, the change would be valid if any recent Windows 10 patch fixed the underlying problems, making Process.use_process_jobs unneeded. But I wouldn't count on that. I'd rather expect that a recent Windows 10 patch introduced the Ctrl-C bug, but @jneira's experiments suggest that it was our innocent refactorings, not upstream OS changes. Also, I'm not sure on which Windows versions that bug manifests.

Mikolaj commented 2 years ago

Actually, now that I'm looking at #7995, it seems it adds Process.delegate_ctlc = True where there was none before (it was included only in function rawSystem that was unused and so was removed). I have no clue what's going on, but perhaps delegate_ctlc conflicts with use_process_jobs on Windows? @robx: perhaps we could just tone down the refactoring and bring back some of the ugly lack of uniformity that might have been justified by Windows quirks (also bearing in mind different Windows versions)?

jneira commented 2 years ago

@jneira: I might have missed the detail: on which versions of Windows did you manage to reproduce the problem?

A updated windows 10, i have no access to a windows 7 machine anymore

Mikolaj commented 2 years ago

A updated windows 10

Thank you. So that's quite serious and not possible to explain by the special treatment Windows 7 gets in the code.