Interrupting cabal-3.8.0.0.20220526 and master with ctrl+c makes it hang on windows

jneira commented 2 years ago

I am experiencing some weird behaviour interrupting cabal rc execution with ctrl+c in windows: Doing a ctrl+c in the Configuring component for my package, with ghc-pkg subprocesses, hangs the program with rc1

I cant reproduce with 3.6.0.0

jneira commented 2 years ago

Am I reading this right that https://github.com/haskell/cabal/pull/7844#issuecomment-983595388 says that our change in https://github.com/haskell/cabal/pull/8312 may cause "race conditions " and that the code we remove there warns that without it "we may see sporadic build failures without jobs"?

Good note, i did not remember that one, it seems enableProcessJobs worked fine in windows >= 8 originally and it was even needed, so we cant remove it 🤦 As suggested #7995 add some change that combined with enableProcessJobs causes that behaviour

robx commented 2 years ago

delegate_ctlc is (documented to be) a no-op on Windows, so it doesn't seem like it should be a factor here.

Mistuke commented 2 years ago

If so, the change would be valid if any recent Windows 10 patch fixed the underlying problems, making Process.use_process_jobs unneeded. But I wouldn't count on that. I'd rather expect that a recent Windows 10 patch introduced the Ctrl-C bug, but @jneira's experiments suggest that it was our innocent refactorings, not upstream OS changes. Also, I'm not sure on which Windows versions that bug manifests.

The point of the process jobs is to ensure that when you press Ctrl-C that not just cabal exits but also any non-native windows programs that it calls get killed. It's unlikely that the Ctrl-C isn't working, what's much likelier is that a child process has refused to exit for whatever reason and the process became a zombie where cabal waits forever.

I'd recommend just looking at the process tree when it doesn't work and figure out which process this is. (I am currently in a train in India so can't really help debug myself).

Since cabal should still be handling the ctrl-c, does it send a sigkill or sigterm?

Mikolaj commented 2 years ago

Since cabal should still be handling the ctrl-c, does it send a sigkill or sigterm?

It seems it wasn't sending either, but it's been fixed in https://github.com/haskell/cabal/pull/7921 (there's an extensive discussion of the Windows situation in that PR) and now it sends sigTERM at https://github.com/haskell/cabal/blob/master/cabal-install/src/Distribution/Client/Signal.hs#L45

Mikolaj commented 2 years ago

delegate_ctlc is (documented to be) a no-op on Windows, so it doesn't seem like it should be a factor here.

Oh dear, it's such a pity we can't test such things easily. I guess the documentation is wrong? But I'm just deducing from #8312.

@robx: what's your recommendation for 3.8? I'd like to tag the release on Monday. We don't have to do the same thing for 3.8 and master branches.

Mistuke commented 2 years ago

delegate_ctlc is (documented to be) a no-op on Windows, so it doesn't seem like it should be a factor here.

Oh dear, it's such a pity we can't test such things easily. I guess the documentation is wrong? But I'm just deducing from #8312.

Right, but the windows part of that PR is. Indeed a no-op. All the process jobs change has done is expose that cabal can leak zombie processes. Why not use https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-terminateprocess which is sigkill or ExitProcess which is sigterm. Process should support these already. Process also already supports termination of nested jobs.

Mistuke commented 2 years ago

If so, the change would be valid if any recent Windows 10 patch fixed the underlying problems, making Process.use_process_jobs unneeded. But I wouldn't count on that. I'd rather expect that a recent Windows 10 patch introduced the Ctrl-C bug, but @jneira's experiments suggest that it was our innocent refactorings, not upstream OS changes. Also, I'm not sure on which Windows versions that bug manifests.

The point of the process jobs is to ensure that when you press Ctrl-C that not just cabal exits but also any non-native windows programs that it calls get killed. It's unlikely that the Ctrl-C isn't working, what's much likelier is that a child process has refused to exit for whatever reason and the process became a zombie where cabal waits forever.

To clarify, the reason for this is because the windows process model does not support exec. Exec when used is emulated by creating a new process and killing the current one. Which means the caller no longer has a handle to the new process. So you detach and lose the ability to kill it etc.

The reason things normally work out is because the parent is usually waiting for data through std handles, which the new process inherits.

So on windows the posix call exec needs to be avoided as much as possible. But certain programs like make, gcc, pkg-config are ported posix applications and so we needed a way to correctly wait for them and terminate them.

jneira commented 2 years ago

The point of the process jobs is to ensure that when you press Ctrl-C that not just cabal exits but also any non-native windows programs that it calls get killed. It's unlikely that the Ctrl-C isn't working, what's much likelier is that a child process has refused to exit for whatever reason and the process became a zombie where cabal waits forever.

I'd recommend just looking at the process tree when it doesn't work and figure out which process this is. (I am currently in a train in India so can't really help debug myself).

Since cabal should still be handling the ctrl-c, does it send a sigkill or sigterm?

@Mistuke many thanks to participate in the discussion

I am gonna try to be more precise describing the behaviour i am observing locally:

Env:
- Windows 10, build 19044
- cabal built from source at https://github.com/haskell/cabal/commit/ddf3ba20c48d9f82fe91ef604defd0c813f296b3 which merged #7995 using ghc-8.10.7 (so process-1.6.13.2)
- runtime ghc also 8.10.7
Tests:

PS D:\dev\ws\haskell\cabal-test> $p = Start-Process "cabal-7995" -ArgumentList build -passthru

In the new window i wait to see the message Configuring library for cabal-test-0.1.0.0.. and i press ctrl+c. That causes the execution hang indefinitely Main and child processes in that state:

PS D:\dev\ws\haskell\cabal-test> Get-ChildProcesses $p.id

ProcessId Name        HandleCount WorkingSetSize VirtualSize
--------- ----        ----------- -------------- -----------
14532     conhost.exe 194         12783616       2203462139904
6700      ghc-pkg.exe 87          5447680        48619520
3984      conhost.exe 106         6598656        2203414073344
15828     ghc-pkg.exe 143         148516864      4559450112

PS D:\dev\ws\haskell\cabal-test> ps -id $p.id

Handles  NPM(K)    PM(K)      WS(K)     CPU(s)     Id  SI ProcessName
-------  ------    -----      -----     ------     --  -- -----------
    152      11    22376      43720       3,83   4220   1 cabal-7995

But if i wait after after cabal calls ghc-pkg, (gcc for example), ctrl+c works perfectly

PS D:\dev\ws\haskell\cabal-test> Get-ChildProcesses $p.id

ProcessId Name         HandleCount WorkingSetSize VirtualSize
--------- ----         ----------- -------------- -----------
5724      conhost.exe  191         12402688       2203462492160
15396     gcc.exe      55          4173824        4380991488
5184      conhost.exe  106         6602752        2203415121920
2652      realgcc.exe  47          4087808        4396949504
8044      collect2.exe 50          3657728        4396445696
13532     ld.exe       68          6475776        4416765952

I am not able to make it hang with other executables like hpc or ghc, but i am not sure cause the execution of them are too fast. Also i dont know how to reproduce ctrl+c programatically: kill -id $cabalProcess.id and $cabalProcess.kill() always works.

Mistuke commented 2 years ago

In the new window i wait to see the message Configuring library for cabal-test-0.1.0.0.. and i press ctrl+c. That causes the execution hang indefinitely Main and child processes in that state:

Ok, ghc and ghc-pkg both use process jobs, so this ends up with a nested process jobs. However because ghc and ghc-pkg are already native processes we don't need a process job to call them. That said it should have worked (unless on windows 7 where the behavior of nested jobs is broken).

One thing to do is use process explorer to see what actually happens when you press ctrl-c. Inspect the state of ghc-pkg, is it doing anything? Then try manually killing ghc-pkg. If cabal then aborts then we know cabal isn't stuck, just waiting for the child to exit.

Does cabal at any point during the Ctrl-C handling on Windows call terminateProcess? This should send SIGTERM to all children in the jobs.

Mistuke commented 2 years ago

Btw, for this release, my suggestion is to just not use process jobs when calling ghc or ghc-pkg. Those two take care of their own house keeping. Keeping it for calls to configure, or anything that invokes sh.

I think (from what I recall) why we didn't add the process jobs everywhere in the beginning (we would make cabal also not work properly on Windows 7 and some people really wanted it to work on windows 7 at the time)

But the bug should be properly addressed for master.

jneira commented 2 years ago

One thing to do is use process explorer to see what actually happens when you press ctrl-c. Inspect the state of ghc-pkg, is it doing anything? Then try manually killing ghc-pkg. If cabal then aborts then we know cabal isn't stuck, just waiting for the child to exit.

Just i was about to write that killing ghc-pkg makes cabal process go out

Mistuke commented 2 years ago

One thing to do is use process explorer to see what actually happens when you press ctrl-c. Inspect the state of ghc-pkg, is it doing anything? Then try manually killing ghc-pkg. If cabal then aborts then we know cabal isn't stuck, just waiting for the child to exit.

Just i was about to write that killing ghc-pkg makes cabal process go out

In that case, I wonder if this is a bug in process. My hypothesis is this: based on https://github.com/haskell/process/blob/master/cbits/win32/runProcess.c#L620 it looks like we only check if we can get the handle to a process, and assume that if we can that the process must be active.

However, the thing to note is that the Windows kernel uses reference counting to track if a kernel object is still required. So while the process may have been killed, the kernel object (and thus the handle) are still valid. During a nested process job both the parent, and the child see all descendants.

So in this case both cabal and ghc-pkg will hold a reference to whatever ghc-pkg called. So both will fall into https://github.com/haskell/process/blob/master/cbits/win32/runProcess.c#L635 which waits indefinitely on a process that is not really alive.

So I wonder if we don't need to call GetExitCodeProcess and check for STILL_ALIVE explicitly before waiting.

There are a couple of ways to test this hypothesis.

In windbg use !process 0 0 to list all zombie processes (the handlecount and objecttable for a zombie process are both 0).
Use process explorer and open ghc-pkg, go to threads and see if anything is waiting on WaitForSingleObject (if you don't see symbol names then you don't have Microsoft's symbol server setup).
Change the infinite wait in process for WaitForSingleObject to something other than infinite, perhaps 2secs or so. Both processes should let go of the kernel objects during the timeouts and the next round the Wait or OpenHandle should fail for one of them and break the stalemate.
Possibly easiest is to use something like https://github.com/zodiacon/ObjectExplorer and check if ghc-pkg has any zombie processes to its name.

But it's starting to feel like a problem with process that we didn't take into account zombies. Though this is just a hypothesis, I'm unable to collect data to back it up atm :)

robx commented 2 years ago

@robx: what's your recommendation for 3.8? I'd like to tag the release on Monday. We don't have to do the same thing for 3.8 and master branches.

I have no recommendation. Do whatever you think is right.

jneira commented 2 years ago

well it seems the alternatives could be:

revert the pr altogether, I tried to do it and it is not trivial (at least for me), my pr reverting did not pass ci
try to apply process jobs to posix executables (again, it seems that was the original situation and we did too much cleaning)
wait for an eventual process bug and fix, I am afraid the release can't wait for it

Mikolaj commented 2 years ago

@robx: what's your recommendation for 3.8? I'd like to tag the release on Monday. We don't have to do the same thing for 3.8 and master branches.

I have no recommendation. Do whatever you think is right.

Thank you for your vote of confidence. Let's revert the #7995 cleanup, but only on branch 3.8. It's risky, because the cleanup does more than just refactor and also because it's used in subsequent commits. But we can't do better for the 3.8-final tag for the GHC people. Perhaps before the actual release in a week or so, or in the next point release, this is going to be fixed properly.

PR https://github.com/haskell/cabal/pull/8295 and the dumber and less subtle (it marks a test as known broken) https://github.com/haskell/cabal/pull/8319 are reverting #7995, so let's see which one is ready first (and which one makes the ctrl+c bug go away).

Mikolaj commented 2 years ago

Any progress with the hang on Windows with C-c? I'm miserable after reverting #7995 and the related PRs, because I need to press C-c dozens of times to stop the initial cabal build on a project with many deps (on Linux), Could we bring it back in 3.8.2.0?

jneira commented 2 years ago

no progress at my end, my plan was try to setup a minimal reproduction case to open a ticket against the process package, but I don't have much time

Mikolaj commented 1 year ago

@Mistuke: any chance you'd have some time to look at that again in the next couple of weeks? I'd hate to revert the original fixes in #7995 and #7921 for cabal 3.10 (soon to be branched off) just as I miserably did for 3.8.

One thing to do is use process explorer to see what actually happens when you press ctrl-c. Inspect the state of ghc-pkg, is it doing anything? Then try manually killing ghc-pkg. If cabal then aborts then we know cabal isn't stuck, just waiting for the child to exit.

Just i was about to write that killing ghc-pkg makes cabal process go out

In that case, I wonder if this is a bug in process. My hypothesis is this: based on https://github.com/haskell/process/blob/master/cbits/win32/runProcess.c#L620 it looks like we only check if we can get the handle to a process, and assume that if we can that the process must be active.

However, the thing to note is that the Windows kernel uses reference counting to track if a kernel object is still required. So while the process may have been killed, the kernel object (and thus the handle) are still valid. During a nested process job both the parent, and the child see all descendants.

So in this case both cabal and ghc-pkg will hold a reference to whatever ghc-pkg called. So both will fall into https://github.com/haskell/process/blob/master/cbits/win32/runProcess.c#L635 which waits indefinitely on a process that is not really alive.

So I wonder if we don't need to call GetExitCodeProcess and check for STILL_ALIVE explicitly before waiting.

There are a couple of ways to test this hypothesis.
1. In windbg use `!process 0 0` to list all zombie processes (the handlecount and objecttable for a zombie process are both 0).

2. Use process explorer and open ghc-pkg, go to threads and see if anything is waiting on `WaitForSingleObject` (if you don't see symbol names then you don't have Microsoft's symbol server setup).

3. Change the infinite wait in process for `WaitForSingleObject` to something other than infinite, perhaps 2secs or so. Both processes should let go of the kernel objects during the timeouts and the next round the Wait or OpenHandle should fail for one of them and break the stalemate.

4. Possibly easiest is to use something like https://github.com/zodiacon/ObjectExplorer and check if ghc-pkg has any zombie processes to its name.
But it's starting to feel like a problem with process that we didn't take into account zombies. Though this is just a hypothesis, I'm unable to collect data to back it up atm :)

Mistuke commented 1 year ago

Sorry I had completely forgotten about this. Yes I'll debug this Saturday.

Sent from my Mobile

On Thu, Dec 29, 2022, 11:13 Mikolaj Konarski @.***> wrote:

@Mistuke https://github.com/Mistuke: any chance you'd have some time to look at that again in the next couple of weeks? I'd hate to revert the original fixes in #7995 https://github.com/haskell/cabal/pull/7995 and

7921 https://github.com/haskell/cabal/pull/7921 for cabal 3.10 (soon

to be branched off) just as I miserably did for 3.8.

One thing to do is use process explorer to see what actually happens when you press ctrl-c. Inspect the state of ghc-pkg, is it doing anything? Then try manually killing ghc-pkg. If cabal then aborts then we know cabal isn't stuck, just waiting for the child to exit.

Just i was about to write that killing ghc-pkg makes cabal process go out

In that case, I wonder if this is a bug in process. My hypothesis is this: based on https://github.com/haskell/process/blob/master/cbits/win32/runProcess.c#L620 it looks like we only check if we can get the handle to a process, and assume that if we can that the process must be active.

However, the thing to note is that the Windows kernel uses reference counting to track if a kernel object is still required. So while the process may have been killed, the kernel object (and thus the handle) are still valid. During a nested process job both the parent, and the child see all descendants.

So in this case both cabal and ghc-pkg will hold a reference to whatever ghc-pkg called. So both will fall into https://github.com/haskell/process/blob/master/cbits/win32/runProcess.c#L635 which waits indefinitely on a process that is not really alive.

So I wonder if we don't need to call GetExitCodeProcess and check for STILL_ALIVE explicitly before waiting.

There are a couple of ways to test this hypothesis.

In windbg use !process 0 0 to list all zombie processes (the handlecount and objecttable for a zombie process are both 0).

Use process explorer and open ghc-pkg, go to threads and see if anything is waiting on WaitForSingleObject (if you don't see symbol names then you don't have Microsoft's symbol server setup).

Change the infinite wait in process for WaitForSingleObject to something other than infinite, perhaps 2secs or so. Both processes should let go of the kernel objects during the timeouts and the next round the Wait or OpenHandle should fail for one of them and break the stalemate.

Possibly easiest is to use something like https://github.com/zodiacon/ObjectExplorer and check if ghc-pkg has any zombie processes to its name.

But it's starting to feel like a problem with process that we didn't take into account zombies. Though this is just a hypothesis, I'm unable to collect data to back it up atm :)

— Reply to this email directly, view it on GitHub https://github.com/haskell/cabal/issues/8208#issuecomment-1367246627, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI7OKKMFRDRYTTQYK34UFLWPVW4NANCNFSM5YPNRBMA . You are receiving this because you were mentioned.Message ID: @.***>

Mikolaj commented 1 year ago

Any preliminary results? :)

Mistuke commented 1 year ago

Just reading through the thread now to refresh my memory while I wait for a cabal build to finish

Mistuke commented 1 year ago

Hmm @jneira how do I find this cabal-7995 test?

Mistuke commented 1 year ago

I've tried building and cancelling some builds but the referenced commit ddf3ba2 seems to work fine. would appreciate some help with the repro.

jneira commented 1 year ago

Hi, i still have the cabal-7995 executable in my local machine, but i believe i built it from that commit. The hang is reproduced when the ghc-pkg subprocess is spawned and i got to do it pressing ctrl+c when the message Configuring library for foo-package is shown

EDIT: just uploaded the executable to compare behaviour, after double checking the issue is reproduced with it cabal-7995.zip

Mistuke commented 1 year ago

Hmm I am still unable to reproduce it using that. I'm starting to think this is likely a race condition somewhere so could you give me a dump when the process hangs for you. you can do so easily using process explorer. http://live.sysinternals.com/procexp64.exe

After this I can create a trace file for you so I can get an API trace from the application which should allow me to find what's happening.

Mistuke commented 1 year ago

please note that this dump will contain your hostname and username in the contents. so if you don't wish to post it publicly you can send it to me privately.

jneira commented 1 year ago

Sure, i've just generate the dump: https://drive.google.com/file/d/1VsmB8ekLwGhE2bqVjFFLFYW2_QfPMjy3/view?usp=share_link

ghc already leaks local computer info in the executable i uploaded, so no problem 😆

Mistuke commented 1 year ago

Thanks, that was very helpful. Ah I forgot how much fun it is to debug Haskell programs without symbols 😓 It looks like there are 6 threads in use, 5 are interesting. 3 of them are blocked on an external event.. All 3 are RTS Capabilities. So basically the scheduler is waiting for work to do. 1 seems to be the GC which makes sense as well.

The last one is waiting on a synchronous file read. This seems to confirm that the deadlock is happening indeed happening because a read doesn't finish. This is most likely a read on the pipe for the I/O redirection.

My hypothesis so far is that without process_jobs the child program doesn't actually terminate when ctrl+c is called. The haskell side stopped listening but the child finished in the background. With process jobs we actually kill the process, but some handle is opened somewhere causing the pipe to stay open. process is supposed to close the unused ends of the pipes to prevent this.

To narrow it down I need to know what happened to the children. To do that I need an API trace. Could you download and install http://www.rohitab.com/apimonitor#Download also save the this as an xml file somewhere https://gist.github.com/Mistuke/6aec7cf82e3e3b39184ac7c36258329d that one contains a configuration to profile haskell I/O operations.

Open API monitor 64-bit, go to the Filter menu an choose load and open the xml file you saved prior. Then go to File -> Monitor new process.

In the dialog that opens, put he path to the program, the working directory and set Attach using to Static import. Run the program and make it hang by pressing ctrl+c in the dialog that opens.

once it hangs, kill all the processes in taskmanager which will give control back to API monitor. Go to File -> Save As and export the trace that was just captured and attach that here.

jneira commented 1 year ago

Wow, many thanks for the insigths. I will try to follow your instructions. I wonder how fits the fact you can't reproduce the hang, it seems there should be some system factor in the ecuation.

EDIT: imagen

Just i was guessing if the difference could be in the way ghc is installed: i m using ghcup, which uses shims to simulate hard links. I guess you are using chocolatey instead 😉. Maybe the culprit could be in that shim: it reads the actual location of the final executable from a file!

In the process explorer you can see the shim in the last position: it is the 32 bits ghc-pkg

jneira commented 1 year ago

i've just genreated the api trace cabal-7995.zip

jneira commented 1 year ago

I wonder if we could fix the shim C code, closing the handle to the file before it calls the real executable. At least the hang would be less probable :thinking:

Mistuke commented 1 year ago

In the process explorer you can see the shim in the last position: it is the 32 bits ghc-pkg

Lol what? It's certainly possible. An child process misbehaving can certainly keep the handle open and becoming a zombie. For the old ghc shims we had we took great strides to pass on signals on to the new callee by detaching from the terminal.

It should be easy to test, just use cabal with -w <fullpathtoghc>? Or I guess it uses the shims for every executable?

You can try manually extracting a ghc tarball and adding it to your path. That would be a good test. Avoid all the shims.

i've just genreated the api trace

Many thanks. Will take a look

jneira commented 1 year ago

I wonder if we could fix the shim C code, closing the handle to the file before it calls the real executable. At least the hang would be less probable :thinking:

It already does it iiuic. There is some code about ctrl+c handling:

https://github.com/71/scoop-better-shimexe/blob/bd14d36a7fd8af7bf8790dd409745e98567ae223/shim.c#L253

@Mistuke

It should be easy to test, just use cabal with -w ?

you read my mind, i had to do it tracing other shim problems in the past

jneira commented 1 year ago

Sooo cabal-7995 build -w D:\ghcup\ghc\8.10.7\bin\ghc.exe also reproduces the hang and this time the shim is not present so it seems it is not the cause 😞

jneira commented 1 year ago

Ok, i've generated two files using cabal-7995 build -w D:\ghcup\ghc\8.10.7\bin\ghc.exe:

the dump of the ghc-pkg subprocess
the api trace again

https://drive.google.com/file/d/1IVE6uGn0tNsiz0bFzr1eePQy4EyALuvE/view?usp=share_link

hasufell commented 1 year ago

Just for the record, the ghcup shim C code is here: https://github.com/haskell/ghcup-hs/blob/master/scoop-better-shimexe/shim.c

Mistuke commented 1 year ago

OK, so I need to dig deeper this weekend, but it looks like cabal has actually terminated:

#   Time of Day Thread  Module  API Return Value    Error   Duration
35895   9:11:43.967 PM  1   KERNEL32.DLL    RtlExitUserProcess ( 0x000000fc )

It has a weird exit code, but that's not relevant. Most of the child processes except one terminated. But there's a ghc-pkg still lingering around waiting on a file read: That never finishes.

Cabal has exited cleanly through ExitProcess, The documentation of this states:

Exiting a process does not cause child processes to be terminated.

and since Cabal is the owner of the kernel object for the process job the child belongs to the kernel makes the zombie process wait for the child. So there are two questions here, first off is why did the child never finish. The read is to a local file.

Regardless of that, the main question still is why was the process not forcibly terminated? The way this is supposed to work is that when the last handle to the job object is closed that this triggers the termination of all children. To do this job objects are created in process with JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE so that calling CloseHandle https://github.com/haskell/process/blob/af0614cc397231a19c152824e34ecc0bd2751bca/cbits/win32/runProcess.c#L220 kills all children.

To prevent resource leakage when the Haskell Handle is created by the job HANDLE we register a finalizer which calls CloseHandle when the haskell process terminates. So based on the fact that cabal has exited, the rts must have already run the finalizers.

So why did this not terminate the process. Well it looks like no finalizers were ran for it. The job last job created was 0x274 and there's no CloseHandle call to this. This was supposed to happen through https://github.com/haskell/process/blob/eb451833f2853f49f715f6b1f639665f7be9c6c1/System/Process/Windows.hsc#L83

So it looks like, in this case the Rts didn't run the finalizers. Would you be able to compile the test program with -debug and run with +RTS -Ds and piping that to a file, trigger the hang and paste the log somewhere. This should tell us what the scheduler did as it was trying to shut down.

If it is an RTS bug, a workaround for Cabal could be to use onException on the part that reads from the child and call terminateJob on the process handle itself.

Mikolaj commented 1 year ago

BTW, I vaguely remember there were some further bugs discovered in process package after this ticket has been opened. I think we reacted by bumping our dep on process. @jneira, are you testing current master branch of cabal? Would it fail in the same way? To clarify, current master branch of cabal is likely to exhibit the same problem as the commits flagged in this ticket, because the PR has been reverted only on 3.8 branch, not on master.

jneira commented 1 year ago

yeah, will check master again to confirm it has the same behaviour

Mikolaj commented 1 year ago

@jneira: any news?

Mikolaj commented 1 year ago

For transparency: I intend to include the PRs that have caused this issue in the cabal 3.10 branch that is about to be cut. However, if even partial fixes are available before 3.10 is released, we'd backport them to 3.10 and, OTOH, if there are no sufficient fixes, but the breakage proves more deadly that it seems to be right now, we'd have to revert (again, because we already reverted for 3.8 on the basis of avoiding unknown unknowns, because we had hints Windows is handled incorrectly, but had no idea how serious the breakage on Windows is).

Mistuke commented 1 year ago

So I need the scheduler output from @jneira to make any more progress. If it still happens on master I think I have a work around. Though I'd really want the scheduler trace to make sure.

That said it's my bday weekend so I'm off till monday :)

jneira commented 1 year ago

sorry, no time to do more tracing till (maybe) the weekend 🙃

Mikolaj commented 1 year ago

Happy Birthday @Mistuke! :D

Mistuke commented 1 year ago

Happy Birthday @Mistuke! :D

Thank you :D

jneira commented 1 year ago

yeah, will check master again to confirm it has the same behaviour

i am afraid that the bug conitnues reproducing for me at bcfc79ce

Mistuke commented 1 year ago

Note that I still require the scheduler trace to make progress here :)

Sent from my Mobile

On Sun, Jan 15, 2023, 15:16 Javier Neira @.***> wrote:

yeah, will check master again to confirm it has the same behaviour

i am afraid that the bug conitnues reproducing for me at bcfc79c https://github.com/haskell/cabal/commit/bcfc79ce1e286b1df1fef31139044c0e0503d5c7

— Reply to this email directly, view it on GitHub https://github.com/haskell/cabal/issues/8208#issuecomment-1383177681, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI7OKIILVYM3AC33UEZ7OTWSQIFXANCNFSM5YPNRBMA . You are receiving this because you were mentioned.Message ID: @.***>

Mikolaj commented 1 year ago

This is going to be a regression in cabal 3.10 now, not on master. However, only @jneira can reproduce it so far and, unfortunately, he is too busy, so let's wait for wider feedback with 3.10.

@Mistuke: thank you for spending the time and confirming it doesn't look as immediately and universally disastrous as I feared. Perhaps it's only a hang on ctrl-c after all and not a symptom of some more general and dangrous flaw.

Mistuke commented 1 year ago

There's an open process bug that looks remarkably similar to this one and I have been able to reproduce it and have a fix for it. So I'm hoping that fixes this problem as well.

Sent from my Mobile

On Thu, Feb 9, 2023, 17:37 Mikolaj Konarski @.***> wrote:

This is going to be a regression in cabal 3.10 now, not on master. However, only @jneira https://github.com/jneira can reproduce it so far and, unfortunately, he is too busy, so lets wait for wider feedback with 3.10.

@Mistuke https://github.com/Mistuke: thank you for spending the time and confirming it doesn't look as immediately and universally disastrous as I feared. Perhaps it's only a hang on ctrl-c after all and not a symptom of some more general and dangrous flaw.

— Reply to this email directly, view it on GitHub https://github.com/haskell/cabal/issues/8208#issuecomment-1424567016, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI7OKOFG3S345XHJBSIOXLWWUTMXANCNFSM5YPNRBMA . You are receiving this because you were mentioned.Message ID: @.***>

Mistuke commented 1 year ago

https://github.com/haskell/process/pull/277 should fix this.

haskell / cabal

Interrupting cabal-3.8.0.0.20220526 and master with ctrl+c makes it hang on windows #8208

7921 https://github.com/haskell/cabal/pull/7921 for cabal 3.10 (soon