haskell / process

Library for dealing with system processes
http://hackage.haskell.org/package/process
Other
87 stars 82 forks source link

MacOS: posix_spawnp: invalid argument (Bad file descriptor) #251

Closed psibi closed 2 years ago

psibi commented 2 years ago

Using the process shipped with GHC 9.2.3 (and even the latest master), I have been able to reproduce a bug while running integration tests for Stack. Following points about it:

For reproducing it via Mac, use this branch:

/Users/ec2-user/.stack/programs/x86_64-osx/ghc-9.2.3/bin/ghc-pkg-9.2.3: startProcess: posix_spawnp: invalid argument (Bad file descriptor)
Main.hs: Exited with exit code: ExitFailure 1
CallStack (from HasCallStack):
  error, called at /Users/ec2-user/stack/test/integration/lib/StackTest.hs:63:34 in main:StackTest
  stack, called at /Users/ec2-user/stack/test/integration/tests/script-extra-dep/Main.hs:4:8 in main:Main

End of log for script-extra-dep
Failed tests:
- script-extra-dep - ExitFailure 1

A quick way to reproduce it doing something like this stack-integration-test -m 111-custom-snapshot.

Note that the above branch is sprinkled with lots of logs. It was done to help debugging. Let me know if you want me to remove it. I have also sprinkled some logs on my process fork and it indicates to me that the bug happens when creating the ghc-pkg process. Also the log indicates that I get the exit status code of 9.

I invoked dtruss using the following command:

sudo dtruss ~/.local/bin/stack-integration-test -m 111-custom-snapshot 

This is the dtruss logs for the above command: https://gist.github.com/psibi/0c3c89dd2b90012d7d9f3a64ceffb73a

psibi commented 2 years ago

There is some more info about the bug here: https://github.com/commercialhaskell/stack/pull/5763#discussion_r899065794

bgamari commented 2 years ago

Thanks for opening this!

Unfortunately, the dtruss output isn't terribly helpful as it exhibits a much different failure mode from what you describe here:

...
Using runghc located at /Users/ec2-user/.stack/programs/x86_64-osx/ghc-9.2.3/bin/runghc
runghc 9.2.3
Initializing/updating the original Pantry store
Checking for project config at: /Users/ec2-user/stack/stack.yaml
Loading project config file stack.yaml
You are not the owner of '/Users/ec2-user/.stack/'. Aborting to protect file permissions.
Retry with '--allow-different-user' to disable this precaution.
stack-integration-test: Received ExitFailure 1 when running
Raw command: /Users/ec2-user/.local/bin/stack update

I suspect the sudo is at fault here. Could you try again with --allow-different-user? Also, do add -f to your dtruss invocation to ensure that forked processes are followed.

psibi commented 2 years ago

Could you try again with --allow-different-user?

Not sure what you mean by that since stack accepts --allow-different-user. Here I'm reproducing the issue with the executable named stack-integration-test which is different than stack and doesn't accept that flag:

ec2-user@ip-10-0-4-6 stack % ~/.local/bin/stack-integration-test --help
Initiating Stack integration test running
Stack integration tests

Usage: stack-integration-test [-s|--speed SPEED] [-m|--match STRING]

Available options:
  -h,--help                Show this help text

Dtruss logs for the following command:

sudo dtruss -f ~/.local/bin/stack-integration-test -m 111-  

Log link: https://gist.github.com/psibi/dc56d065590253f033e1de6a0a820c4d

bgamari commented 2 years ago

Not sure what you mean by that since stack accepts --allow-different-user

@psibi, you may need to either:

Dtruss logs for the following command:

It looks like this failed for yet another reason:

stack-integration-test: Executable named runghc not found on path: ["/usr/local/bin","/usr/bin","/bin","/usr/sbin","/sbin"]

Are you sure PATH is set properly?

psibi commented 2 years ago

@bgamari I did these things:

I cannot allow Darwin to execute dtruss as an unprivileged user since it seems it's not possible to do in AWS EC2's Mac instance. This is the command I ran:

sudo dtruss -f ~/.local/bin/stack-integration-test -m 111-custom-snapshot

Log link: https://gist.github.com/psibi/8c65361bab44fda64523e20d45790dc0

bgamari commented 2 years ago

Unfortunately it appears that dtruss isn't capturing calls to posix_spawnp, likely because it is not a system call. I suppose I will need to try to reproduce this locally. Any hints for doing so, @psibi ?

psibi commented 2 years ago

@bgamari It's straightforward to reproduce it in MacOS with the stack codebase. You can follow these steps:

Once you have the executable ready, executing something like this will reproduce the bug:

$ stack-integration-test -m 111-custom-snapshot

The above command will run the integration test of 111-custom-snapshot and that will lead straight into this bug.

bgamari commented 2 years ago

Build the branch and create the executable stack-integration-test out of it.

Specifically how should I go about building it such that it links against my process tree? Apologies for the silly question; I have relatively little experience with stack.

psibi commented 2 years ago

@bgamari Sorry, I should been more clear. These are the steps:

stack install --flag=stack:integration-tests stack
stack-integration-test -m 111-custom-snapshot

Let me know if you are stuck or need any more help. :-) (I don't have a personal Mac and I did my entire testing using Amazon's EC2 Mac instance. I can create a new instance and help you further, if any steps are unclear)

bgamari commented 2 years ago

Thanks @psibi! Very helpful.

mpickering commented 2 years ago

@psibi I can't build with your instructions. Can you please help?

process                      > [1 of 2] Compiling Main             ( /private/var/folders/zm/mbjlbmnj1bg9gnpsf60mwq7c0000gq/T/stack-31b596dfe164f4b4/process-1.6.14.0/Setup.hs, /private/var/folders/zm/mbjlbmnj1bg9gnpsf60mwq7c0000gq/T/stack-31b596dfe164f4b4/process-1.6.14.0/.stack-work/dist/x86_64-osx/Cabal-3.6.3.0/setup/Main.o )
process                      > [2 of 2] Compiling StackSetupShim   ( /Users/matt/.stack/setup-exe-src/setup-shim-mPHDZzAJ.hs, /private/var/folders/zm/mbjlbmnj1bg9gnpsf60mwq7c0000gq/T/stack-31b596dfe164f4b4/process-1.6.14.0/.stack-work/dist/x86_64-osx/Cabal-3.6.3.0/setup/StackSetupShim.o )
process                      > Linking /private/var/folders/zm/mbjlbmnj1bg9gnpsf60mwq7c0000gq/T/stack-31b596dfe164f4b4/process-1.6.14.0/.stack-work/dist/x86_64-osx/Cabal-3.6.3.0/setup/setup ...
process                      > Configuring process-1.6.14.0...
process                      > Warning: The 'build-type' is 'Configure' but there is no 'configure' script.
process                      > You probably need to run 'autoreconf -i' to generate it.
process                      > setup: configure script not found.
mpickering commented 2 years ago

Solved by adding autoreconf to my path.

mpickering commented 2 years ago

Seems that the error is due to using (setStdin closed pc) in sinkProcessStdout if you change it to use setStdin nullStream pc then it works.

psibi commented 2 years ago

@mpickering You mean in the Stack codebase ?

Because the same codebase is working for both Linux & Windows OS (and the previous process version used to work for Mac too).

psibi commented 2 years ago

Also, just to add context: I currently have an workaround for this in Mac: https://github.com/commercialhaskell/stack/pull/5763#discussion_r898936062

But ideally it's good to avoid that workaround.

mpickering commented 2 years ago

What I suggested is a workaround and highlights the cause of the issue. Your workaround also works for the same reason, because sinkProcessStderrStdout doesn't try to close the StdIn handle.

mpickering commented 2 years ago

In particular I can't reproduce this directly but I think the error is something to do with nested posix_spawnp, ie one Haskell executable calling another haskell executable, calling another executable.

mpickering commented 2 years ago

Here's at least one test which does something different on Linux/Mac but I'm not sure it's the same issue.

https://gist.github.com/bd27f3db19ddbad480c8eae212058103

If you compile both of these files then run ./T2, it fails on OSX for me with TestProcess: /Users/matt/.ghcup/bin/ghc-pkg: createProcess: posix_spawnp: failed (Undefined error: 0) but works on linux (NixOS).

robx commented 2 years ago

Here's at least one test which does something different on Linux/Mac but I'm not sure it's the same issue.

I tried reproducing this on a C level; I'm not getting quite the same errors as reported here so it's probably something slightly different, but the program below (compile with cc spawn.c, then run e.g. ./a.out echo hello) works on Linux, while it fails on macOS with posix_spawnp: No such file or directory. So on macOS asking for an already-closed file descriptor to be closed fails.

spawn.c ``` #include #include #include #include #include int main(int argc, char **argv) { pid_t pid; posix_spawn_file_actions_t fa; if (argc < 2) { exit(0); } if (close(STDIN_FILENO) != 0) { perror("close"); exit(1); } if (posix_spawn_file_actions_init(&fa) != 0) { perror("posix_spawn_file_actions_init"); exit(1); } if (posix_spawn_file_actions_addclose(&fa, STDIN_FILENO) != 0) { perror("posix_spawn_file_actions_addclose"); exit(1); } if (posix_spawnp(&pid, argv[1], &fa, NULL, &argv[1], NULL) != 0) { perror("posix_spawnp"); exit(1); } waitpid(pid, NULL, 0); } ```
bgamari commented 2 years ago

Thanks for narrowing this down, @mpickering. The problem indeed appears to be that posix_spawnp is failing due to close being called on an already-closed fd. It's a bit unclear what the specification intends here. It would make sense if addclose would "eat" errors since this would provide a race-free way of ensuring a subprocess executes with a closed fd. However, POSIX 1-2008 says of posix_spawn_file_actions_addclose:

It shall not be considered an error for the fildes argument passed to these functions to specify a file descriptor for which the specified operation could not be performed at the time of the call. Any such error will be detected when the associated file actions object is later used during a posix_spawn() or posix_spawnp() operation.

Which suggests that the posix_spawnp should fail if an addclose action is added on a closed fd. Some people recommend avoiding addclose for this and other reasons, instead advocating use of O_CLOEXEC. However, this seems extremely racy in a multithreaded program (e.g. what if two threads attempt to spawn a subprocess, one with stdin open and the other with it closed).

bgamari commented 2 years ago

The semantics suggested above are quite unfortunate since addclose throwing an error on a closed fd precludes our one means of reliably starting a subprocess with a closed stdin/stdout/stderr. This seems like quite an oversight on the part of the Open Group and I can only imagine that this is why glibc and musl rather opted to implement the non-erroring semantics, despite the fact that POSIX specifies otherwise.

We can, however, work around this by using the fact that posix_spawn_file_actions_addopen will close the fd being opened if it is already open. For instance, if we want to ensure that stdin is closed we first addopen(stdin, "/dev/null") (serving to close the inherited stdin, if it exists, and ensuring that there is a valid handle to close) and then call addclose(stdin). Given that this adds a bit of spawn overhead, we should probably guard this on a configure check.

bgamari commented 2 years ago

@mpickering has confirmed that #257 fixes the Stack reproducer.

psibi commented 2 years ago

Thanks everyone for fixing this!