Open jneira opened 2 years ago
Possibly introduced by #7921, even though it says "on unix systems".
could you try building cabal with process 1.6.14 or later? building with GHC 9.2 may include that by default I don't remember which GHC gitlab CI uses and so which which GHC the cabal binaries have been built nor which GHC ghcup uses to build RC1 cabal anyway, that's a long shot, process < 1.6.14 had some faults, but possibly not as grave as hanging
cabal in gitlab is built with ghc-8.10.7 iirc
I checked that pr and I did not get any hang, tests could be wrong though https://github.com/haskell/cabal/pull/7921#issuecomment-1064897302
And a confirmation of the bug by another windows user would be great, i will try again with other projects myself
Yes, dear Windows users, please reproduce.
The "on unix systems" is just to say that the change isn't expected to necessarily fix things on Windows. It definitely does touch Windows too, and there's a decent chance of something like this slipping through...
One factor that might come into play is that GHC has two different Windows I/O managers, which behave differently with respect to types and sub-processes, see e.g. here https://github.com/haskell/process/issues/235. It might be worth checking if that is involved here. (I don't really know which GHC version uses which I/O manager by default and/or whether they're switchable...)
in my test I used ghc-8.10.7, which did not have the new io manager cabal release itself has been built with ghc-8.10.7 afair that new io manager is still not the default in ghc-9.2
FYI: I'm getting "waitForProcess: does not exist (No child process)" when interrupting long running cabal test
, on an ancient Ubuntu. I don't remember getting that before 3.8, but my memory is unreliable. However, this is obviously better than accumulating half-dead processes.
We still have to decide whether to revert for 3.8, e.g., waiting until the new GHC IO fixes the problem or until somebody fixes it differently or until enough Windows users can't reproduce it and so a fluke on a single Windows system is a valid suspicion.
An argument for not reverting is that we only got one report of the problem for the pre-release. However, not many Windows users are early adopters so that may be a sampling bias.
We even are not sure what is the cause (if it is reproducible at all!). Small evidence to perform any action or delay de release imho. I am afraid only the release will give us more feedback 🤷
It's a bit tangential to this particular issue, but how about generally building releases with GHC 9.2? That would address those "no child process" errors for the released cabal versions at least.
Regarding Windows issues in general, it'd be great if there were some easy way for non-Windows devs to get access to a Windows dev environment. Is there some organisation that could e.g. provide access to Windows VMs to contributors of core Haskell tooling?
@hasufell: in your experience, is GHC 9.2.3 stable enough to build cabal 3.8.1 with it (that ghcup would later distribute)? I'm in favour, to avoid both of stale processes and "waitForProcess: does not exist (No child process)". I suppose changing GHC_VERSION
in .gitlab-ci.yml
would be enough to effect the change?
If anybody has a clue about free Windows VMs for cabal devs, please let us know. @bgamari: do you think HF or GHC HQ may have any?
in your experience, is GHC 9.2.3 stable enough to build cabal 3.8.1
I don't use 9.2.3 actively, so I don't know.
But don't your users complain to you about particular GHCs and demand their money back?
But don't your users complain to you about particular GHCs and demand their money back?
Constantly.
I take it as GHC 9.2.3 not being particularly maligned by the users, so let's try #8271.
@jneira: I've compiled branch 3.8 with GHC 9.2.3. Could you try if your problem persists? https://gitlab.haskell.org/haskell/cabal/-/pipelines/54283
@robx: I'm still getting "waitForProcess: does not exist (No child process)" [edit: most of the time] with branch 3.8 cabal compiled on GHC 9.2.3 (https://gitlab.haskell.org/haskell/cabal/-/jobs/1106009) and I'm not ever getting that with cabal 3.6.2. Is that just my ancient Ubuntu acting up? Could you repeat your tests with this version?
Edit: and both 3.6.2 from ghcup (which doesn't show the waitForProcess message for me) and the newly compiled cabal are said by ldd to be static exes, so probably both come from the gitlab-ci job build-x86_64-linux-alpine, so that's probably not the cause of the differences.
@robx: I'm still getting "waitForProcess: does not exist (No child process)" [edit: most of the time] with branch 3.8 cabal compiled on GHC 9.2.3 (https://gitlab.haskell.org/haskell/cabal/-/jobs/1106009) and I'm not ever getting that with cabal 3.6.2. Is that just my ancient Ubuntu acting up? Could you repeat your tests with this version?
I was under the mistaken impression that GHC 9.2.3 shipped with the fixed version of the process package. Unfortunately, that turns out not to be the case: It ships with process-1.6.13.2 while the fix is in process-1.6.14.0. Sorry about that :/ (though it's probably still a good change to build with GHC 9.2).
though it's probably still a good change to build with GHC 9.2
Yes, I think so.
It's trivial to build against a newer core library. You do not need to update GHC for that.
Regarding Windows issues in general, it'd be great if there were some easy way for non-Windows devs to get access to a Windows dev environment. Is there some organisation that could e.g. provide access to Windows VMs to contributors of core Haskell tooling?
well at least we have free windows machines in ci (GitHub and gitlab) But it is hard to reproduce the hang programatically, maybe killing via cli long running cabal processes at a random points?
I've just tried cabal from https://gitlab.haskell.org/haskell/cabal/-/jobs/1108769 and I no longer see the spurious process messages. [edit: thanks to #8279]
@jneira: I wonder if the Windows binary from that pipeline (but different job) hangs still?
@jneira: specifically, this Windows job: https://gitlab.haskell.org/haskell/cabal/-/jobs/1108774 or just get it with ghcup (cargo-culting from @hasufell):
ghcup --no-cache install cabal -u 'https://gitlab.haskell.org/haskell/cabal/-/jobs/1108774/artifacts/raw/out/cabal-install-3.8.0.20220526-x86_64-windows.zip?inline=false' head
@Mikolaj is this a proper prerelease?
Maybe we should create a process when cabal maintainers can submit new entries to the ghcup prelease channel.
Maybe we should create a process when cabal maintainers can submit new entries to the ghcup prelease channel.
Whenever they want.
Right, but is there a description of how to do that? I see "Adding a new GHC version" on ghc-metadata repo. Maybe it could be extended to mention prerelease channel and cabal.
@hasufell: no, this is a fake pre-release. In particular, the few new changes are not in the release notes and the release name is unchanged vs the previous real rc1.
Re ghcup process, this one only involves me, Julian and a certain volume of beer, so probably not worth codifying yet.
Edit: I misunderstood it's about documenting in Cabal docs. Please disregard.
I actually meant both: ghcup and cabal docs. But if you think it's fine in the current beer-based form, I will not argue otherwise!
I don't have much time lately, will try to check this version but don't wait for me
@jneira: thank you!
Friendly reminder, you dont need build from source or use ghcup to try a cabal executable from any pr, the github validate workflow make available the cabal executable used to run the test suites as artifacts of the workflow. For example for the pr backporting the process bump up for 3,8 you can download the executables here: https://github.com/haskell/cabal/actions/runs/2637190469#artifacts
I've tried such executable and i am afraid ctrl+c continue not working 😢
Will try to check a version reverting #7921 just in case
@hasufell: in your experience, is GHC 9.2.3 stable enough to build cabal 3.8.1 with it (that ghcup would later distribute)? I'm in favour, to avoid both of stale processes and "waitForProcess: does not exist (No child process)". I suppose changing
GHC_VERSION
in.gitlab-ci.yml
would be enough to effect the change?If anybody has a clue about free Windows VMs for cabal devs, please let us know. @bgamari: do you think HF or GHC HQ may have any?
I just use qemu; the installation media is freely available from https://www.microsoft.com/en-us/software-download/windows10. IIRC, you are free to use it for a period of a few weeks. After that you are supposed to activate the installation but a friend told me that if you don't nothing particularly terrible happens. I have a few notes on how I configure the VM here.
@robx: ^^^. Also, @jneira warns about it here: https://github.com/haskell/cabal/issues/8230#issuecomment-1186120713
@bgamari: thank you.
Ok, i've checked branch 3.8 with #7291 reverted (with the executable built by this pr in my repo) ,and cabal continues ignoring ctrl+c so @robx pr seems to be innocent and my tests in that pr were correct 😄
But the problem is present and the fact in my env one version works and the newer one does not worries me. We dont have suspects now though 😟
But the problem is present and the fact in my env one version works and the newer one does not worries me. We dont have suspects now though 😟
Maybe #7995 @robx? It was a refactor afaics but i did not check that one
EDIT: I am reverting #7995 just in case in the mentioned pr https://github.com/jneira/cabal/pull/13, will check locally
Ok, so i've tried reverting #7995 and ctrl+c works again (executable here). I would like to provide a more substantial evidence than a local test but setup a test simulating it in windows will take quite time (at least for me)
@robx i would like to examine the pr and find the cause to not having to revert it but i will not have time to not delay the release excessively, what do you think about?
Thanks for your work narrowing this down!
The only vague idea of the cause that I have is that the PR changes things to apply enable_process_jobs
a bit more consistently (which I understood to be desirable/harmless, but...).
Two things I'd want to do to get closer to the issue are:
enableProcessJobs
call here: https://github.com/haskell/cabal/blob/207d4ee08b929ae71ae2c746fe6da660bc129f05/Cabal/src/Distribution/Compat/Process.hs#L61Unfortunately I can't promise a timeline on getting a Windows development environment set up. :/
Guys, you rock, this bug could haunt us for years and you almost nailed it already. Please take your time. The only semi-urgent decision is whether to revert for 3.8 (we'd ideally release in 2 weeks). Let's try to make this decision before the weekend. @robx: I will defer to you on that.
Unfortunately I can't promise a timeline on getting a Windows development environment set up. :/
I've had success compiling stuff in wine with ghc-8.10.7... the main problem is that it lacks msys2 support, so packages that have configure scripts or need system libs won't build. Haven't looked much deeper into it. But I was able to build 80% of cabals dependencies.
Ccing @Mistuke, who may also be interested in this.
enableProcessJobs
Two things I'd want to do to get closer to the issue are:
* bisect the commits in the PR (that hunch points at [a5986e3](https://github.com/haskell/cabal/commit/a5986e37f45d3fca329fa2a0bf03d659b2f7354d)) * disable process jobs, by removing the `enableProcessJobs` call here:
@robx
Your hypothesis was right, removing enableProcessJobs
makes cabal honour ctrl-c again in my windows 10.
It would be great to know the root cause and why process jobs does not work with windows > 7 and process > 1
6.9 (1.6.14 in my tests)
But in the meanwhile could we simply removing enableProcessJobs
?
Oh, yes, let's please decide. I'd love to tag the release ASAP so that we can give it the last round of dog-fooding, while the fine GHC folk already use the tag for their own intricate purposes.
We should open an issue upstream in any case but i opened #8312 just in case we decide to removing the function (for now?)
Thank you so much for following up on this!
I've traced the enableProcessJobs
code back to commit https://github.com/haskell/cabal/commit/6e88f4abf63490b9347735a36f0110d2b17075d6, maybe @bgamari could weigh in on whether we're likely to be breaking things with #8312?
For context, my understanding while working on #7995 was that enableProcessJobs
is something we always want to do on Windows, and if we didn't in some cases, that was an accident. Not so sure anymore, clearly. But regardless, if it can cause hangs in some cases and removing it doesn't obviously break things, it seems like a good choice to remove it...
EDIT: Some more discussion of process jobs in this ticket: https://github.com/haskell/cabal/issues/7309.
@jneira: I might have missed the detail: on which versions of Windows did you manage to reproduce the problem?
Am I reading this right that https://github.com/haskell/cabal/pull/7844#issuecomment-983595388 says that our change in https://github.com/haskell/cabal/pull/8312 may cause "race conditions " and that the code we remove there warns that without it "we may see sporadic build failures without jobs"? And that's for all Windows versions (but for Windows 7 the same risk was there already with original code, so no difference).
If so, the change would be valid if any recent Windows 10 patch fixed the underlying problems, making Process.use_process_jobs
unneeded. But I wouldn't count on that. I'd rather expect that a recent Windows 10 patch introduced the Ctrl-C bug, but @jneira's experiments suggest that it was our innocent refactorings, not upstream OS changes. Also, I'm not sure on which Windows versions that bug manifests.
Actually, now that I'm looking at #7995, it seems it adds Process.delegate_ctlc = True
where there was none before (it was included only in function rawSystem
that was unused and so was removed). I have no clue what's going on, but perhaps delegate_ctlc
conflicts with use_process_jobs
on Windows? @robx: perhaps we could just tone down the refactoring and bring back some of the ugly lack of uniformity that might have been justified by Windows quirks (also bearing in mind different Windows versions)?
@jneira: I might have missed the detail: on which versions of Windows did you manage to reproduce the problem?
A updated windows 10, i have no access to a windows 7 machine anymore
A updated windows 10
Thank you. So that's quite serious and not possible to explain by the special treatment Windows 7 gets in the code.
I am experiencing some weird behaviour interrupting cabal rc execution with ctrl+c in windows: Doing a ctrl+c in the
Configuring component for my package
, with ghc-pkg subprocesses, hangs the program with rc1I cant reproduce with 3.6.0.0