Closed psibi closed 2 years ago
There is some more info about the bug here: https://github.com/commercialhaskell/stack/pull/5763#discussion_r899065794
Thanks for opening this!
Unfortunately, the dtruss
output isn't terribly helpful as it exhibits a much different failure mode from what you describe here:
...
Using runghc located at /Users/ec2-user/.stack/programs/x86_64-osx/ghc-9.2.3/bin/runghc
runghc 9.2.3
Initializing/updating the original Pantry store
Checking for project config at: /Users/ec2-user/stack/stack.yaml
Loading project config file stack.yaml
You are not the owner of '/Users/ec2-user/.stack/'. Aborting to protect file permissions.
Retry with '--allow-different-user' to disable this precaution.
stack-integration-test: Received ExitFailure 1 when running
Raw command: /Users/ec2-user/.local/bin/stack update
I suspect the sudo
is at fault here. Could you try again with --allow-different-user
? Also, do add -f
to your dtruss
invocation to ensure that forked processes are followed.
Could you try again with --allow-different-user?
Not sure what you mean by that since stack
accepts --allow-different-user
. Here I'm reproducing the issue with the executable named stack-integration-test
which is different than stack
and doesn't accept that flag:
ec2-user@ip-10-0-4-6 stack % ~/.local/bin/stack-integration-test --help
Initiating Stack integration test running
Stack integration tests
Usage: stack-integration-test [-s|--speed SPEED] [-m|--match STRING]
Available options:
-h,--help Show this help text
Dtruss logs for the following command:
sudo dtruss -f ~/.local/bin/stack-integration-test -m 111-
Log link: https://gist.github.com/psibi/dc56d065590253f033e1de6a0a820c4d
Not sure what you mean by that since stack accepts
--allow-different-user
@psibi, you may need to either:
stack-integration-test
to pass --allow-different-user
to stack
chown root -R /Users/ec2-user/.stack/
to circumvent the check, ordtruss
as an unprivileged userDtruss logs for the following command:
It looks like this failed for yet another reason:
stack-integration-test: Executable named runghc not found on path: ["/usr/local/bin","/usr/bin","/bin","/usr/sbin","/sbin"]
Are you sure PATH
is set properly?
@bgamari I did these things:
stack
to use --allow-different-user
I cannot allow Darwin to execute dtruss
as an unprivileged user since it seems it's not possible to do in AWS EC2's Mac instance. This is the command I ran:
sudo dtruss -f ~/.local/bin/stack-integration-test -m 111-custom-snapshot
Log link: https://gist.github.com/psibi/8c65361bab44fda64523e20d45790dc0
Unfortunately it appears that dtruss
isn't capturing calls to posix_spawnp
, likely because it is not a system call. I suppose I will need to try to reproduce this locally. Any hints for doing so, @psibi ?
@bgamari It's straightforward to reproduce it in MacOS with the stack codebase. You can follow these steps:
stack-integration-test
out of it.Once you have the executable ready, executing something like this will reproduce the bug:
$ stack-integration-test -m 111-custom-snapshot
The above command will run the integration test of 111-custom-snapshot
and that will lead straight into this bug.
Build the branch and create the executable stack-integration-test out of it.
Specifically how should I go about building it such that it links against my process
tree? Apologies for the silly question; I have relatively little experience with stack
.
@bgamari Sorry, I should been more clear. These are the steps:
stack
and stack-integration-test
binary into ~/.local/bin
:stack install --flag=stack:integration-tests stack
stack-integration-test
binary:stack-integration-test -m 111-custom-snapshot
Let me know if you are stuck or need any more help. :-) (I don't have a personal Mac and I did my entire testing using Amazon's EC2 Mac instance. I can create a new instance and help you further, if any steps are unclear)
Thanks @psibi! Very helpful.
@psibi I can't build with your instructions. Can you please help?
process > [1 of 2] Compiling Main ( /private/var/folders/zm/mbjlbmnj1bg9gnpsf60mwq7c0000gq/T/stack-31b596dfe164f4b4/process-1.6.14.0/Setup.hs, /private/var/folders/zm/mbjlbmnj1bg9gnpsf60mwq7c0000gq/T/stack-31b596dfe164f4b4/process-1.6.14.0/.stack-work/dist/x86_64-osx/Cabal-3.6.3.0/setup/Main.o )
process > [2 of 2] Compiling StackSetupShim ( /Users/matt/.stack/setup-exe-src/setup-shim-mPHDZzAJ.hs, /private/var/folders/zm/mbjlbmnj1bg9gnpsf60mwq7c0000gq/T/stack-31b596dfe164f4b4/process-1.6.14.0/.stack-work/dist/x86_64-osx/Cabal-3.6.3.0/setup/StackSetupShim.o )
process > Linking /private/var/folders/zm/mbjlbmnj1bg9gnpsf60mwq7c0000gq/T/stack-31b596dfe164f4b4/process-1.6.14.0/.stack-work/dist/x86_64-osx/Cabal-3.6.3.0/setup/setup ...
process > Configuring process-1.6.14.0...
process > Warning: The 'build-type' is 'Configure' but there is no 'configure' script.
process > You probably need to run 'autoreconf -i' to generate it.
process > setup: configure script not found.
Solved by adding autoreconf
to my path.
Seems that the error is due to using (setStdin closed pc)
in sinkProcessStdout
if you change it to use setStdin nullStream pc
then it works.
@mpickering You mean in the Stack codebase ?
Because the same codebase is working for both Linux & Windows OS (and the previous process version used to work for Mac too).
Also, just to add context: I currently have an workaround for this in Mac: https://github.com/commercialhaskell/stack/pull/5763#discussion_r898936062
But ideally it's good to avoid that workaround.
What I suggested is a workaround and highlights the cause of the issue. Your workaround also works for the same reason, because sinkProcessStderrStdout
doesn't try to close the StdIn handle.
In particular I can't reproduce this directly but I think the error is something to do with nested posix_spawnp, ie one Haskell executable calling another haskell executable, calling another executable.
Here's at least one test which does something different on Linux/Mac but I'm not sure it's the same issue.
https://gist.github.com/bd27f3db19ddbad480c8eae212058103
If you compile both of these files then run ./T2
, it fails on OSX for me with TestProcess: /Users/matt/.ghcup/bin/ghc-pkg: createProcess: posix_spawnp: failed (Undefined error: 0)
but works on linux (NixOS).
Here's at least one test which does something different on Linux/Mac but I'm not sure it's the same issue.
I tried reproducing this on a C level; I'm not getting quite the same errors as reported here so it's probably something slightly different, but the program below (compile with cc spawn.c
, then run e.g. ./a.out echo hello
) works on Linux, while it fails on macOS with posix_spawnp: No such file or directory
. So on macOS asking for an already-closed file descriptor to be closed fails.
Thanks for narrowing this down, @mpickering. The problem indeed appears to be that posix_spawnp
is failing due to close
being called on an already-closed fd. It's a bit unclear what the specification intends here. It would make sense if addclose
would "eat" errors since this would provide a race-free way of ensuring a subprocess executes with a closed fd. However, POSIX 1-2008 says of posix_spawn_file_actions_addclose
:
It shall not be considered an error for the fildes argument passed to these functions to specify a file descriptor for which the specified operation could not be performed at the time of the call. Any such error will be detected when the associated file actions object is later used during a posix_spawn() or posix_spawnp() operation.
Which suggests that the posix_spawnp
should fail if an addclose
action is added on a closed fd. Some people recommend avoiding addclose
for this and other reasons, instead advocating use of O_CLOEXEC
. However, this seems extremely racy in a multithreaded program (e.g. what if two threads attempt to spawn a subprocess, one with stdin
open and the other with it closed).
The semantics suggested above are quite unfortunate since addclose
throwing an error on a closed fd
precludes our one means of reliably starting a subprocess with a closed stdin/stdout/stderr. This seems like quite an oversight on the part of the Open Group and I can only imagine that this is why glibc
and musl
rather opted to implement the non-erroring semantics, despite the fact that POSIX specifies otherwise.
We can, however, work around this by using the fact that posix_spawn_file_actions_addopen
will close the fd
being opened if it is already open. For instance, if we want to ensure that stdin
is closed we first addopen(stdin, "/dev/null")
(serving to close the inherited stdin
, if it exists, and ensuring that there is a valid handle to close) and then call addclose(stdin)
. Given that this adds a bit of spawn overhead, we should probably guard this on a configure
check.
@mpickering has confirmed that #257 fixes the Stack reproducer.
Thanks everyone for fixing this!
Using the process shipped with GHC 9.2.3 (and even the latest master), I have been able to reproduce a bug while running integration tests for Stack. Following points about it:
For reproducing it via Mac, use this branch:
A quick way to reproduce it doing something like this
stack-integration-test -m 111-custom-snapshot
.Note that the above branch is sprinkled with lots of logs. It was done to help debugging. Let me know if you want me to remove it. I have also sprinkled some logs on my process fork and it indicates to me that the bug happens when creating the
ghc-pkg
process. Also the log indicates that I get the exit status code of 9.I invoked dtruss using the following command:
This is the dtruss logs for the above command: https://gist.github.com/psibi/0c3c89dd2b90012d7d9f3a64ceffb73a