Open rgooch opened 2 months ago
Related Issues and Documentation
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
I think we'll need a reproducer to do anything about this report.
You didn't really say what the child process is doing, and in particular whether it might itself be spawing other processes. See the discussion at https://pkg.go.dev/os/exec#Cmd.WaitDelay. Could this be related to that?
The subprocess is quite simple: it's a BusyBox shell (ash) script:
#! /tmp/bb/ash
set -eu
ls -Fl /proc/$$/fd > /tmp/test.fds
sleep 0.05
exit 0
I know about WaitDelay
, but that should not be needed because the process and all children exit promptly.
Thanks.
Both the read and write side of the stderr pipe are open, and the stdout pipe is nowhere to be found.
I may be misreading, but it seems to me that the pipes in the child are 61465 and 61466, whereas the pipe in the parent is 61446, which is different from both. That is, the parent doesn't seem to have either of the pipes in the child, though it does have a different unrelated pipe.
If you send the hanging process a SIGQUIT
, or press the ^\
keys, do you get a stack trace of where the program is hanging? It might help to know exactly which system call is not returning. Thanks.
Ah, good eye. Here are extracts of the relevant stack trace:
goroutine 1 [chan receive, 1246 minutes]:
os/exec.(*Cmd).awaitGoroutines(0xc0003fc180, 0x0)
/usr/local/go1.22.6-amd64/src/os/exec/exec.go:957 +0x3b1
os/exec.(*Cmd).Wait(0xc0003fc180)
/usr/local/go1.22.6-amd64/src/os/exec/exec.go:924 +0x2d0
goroutine 181 [IO wait, 1246 minutes]:
internal/poll.runtime_pollWait(0x7fc1d4e7acc8, 0x72)
/usr/local/go1.22.6-amd64/src/runtime/netpoll.go:345 +0x85
internal/poll.(*pollDesc).wait(0xc000690140, 0x72, 0x1)
/usr/local/go1.22.6-amd64/src/internal/poll/fd_poll_runtime.go:84 +0xb1
internal/poll.(*pollDesc).waitRead(...)
/usr/local/go1.22.6-amd64/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000690120, {0xc00028c000, 0x8000, 0x8000})
/usr/local/go1.22.6-amd64/src/internal/poll/fd_unix.go:164 +0x466
os.(*File).read(...)
/usr/local/go1.22.6-amd64/src/os/file_posix.go:29
os.(*File).Read(0xc000398008, {0xc00028c000, 0x8000, 0x8000})
/usr/local/go1.22.6-amd64/src/os/file.go:118 +0xad
io.copyBuffer({0xcde0a0, 0xc0000fa000}, {0xcde5c0, 0xc00048e010}, {0x0, 0x0, 0x0})
/usr/local/go1.22.6-amd64/src/io/io.go:429 +0x29b
io.Copy(...)
/usr/local/go1.22.6-amd64/src/io/io.go:388
os.genericWriteTo(0xc000398008, {0xcde0a0, 0xc0000fa000})
/usr/local/go1.22.6-amd64/src/os/file.go:269 +0x70
os.(*File).WriteTo(0xc000398008, {0xcde0a0, 0xc0000fa000})
/usr/local/go1.22.6-amd64/src/os/file.go:247 +0xd5
io.copyBuffer({0xcde0a0, 0xc0000fa000}, {0xcddf00, 0xc000398008}, {0x0, 0x0, 0x0})
/usr/local/go1.22.6-amd64/src/io/io.go:411 +0xd4
io.Copy(...)
/usr/local/go1.22.6-amd64/src/io/io.go:388
os/exec.(*Cmd).writerDescriptor.func1()
/usr/local/go1.22.6-amd64/src/os/exec/exec.go:580 +0x5b
os/exec.(*Cmd).Start.func2(0xc0002640a0)
/usr/local/go1.22.6-amd64/src/os/exec/exec.go:733 +0x3d
created by os/exec.(*Cmd).Start in goroutine 1
/usr/local/go1.22.6-amd64/src/os/exec/exec.go:732 +0x11c5
goroutine 182 [IO wait, 1246 minutes]:
internal/poll.runtime_pollWait(0x7fc1d4e7aad8, 0x72)
/usr/local/go1.22.6-amd64/src/runtime/netpoll.go:345 +0x85
internal/poll.(*pollDesc).wait(0xc000690200, 0x72, 0x1)
/usr/local/go1.22.6-amd64/src/internal/poll/fd_poll_runtime.go:84 +0xb1
internal/poll.(*pollDesc).waitRead(...)
/usr/local/go1.22.6-amd64/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0006901e0, {0xc0001aa000, 0x8000, 0x8000})
/usr/local/go1.22.6-amd64/src/internal/poll/fd_unix.go:164 +0x466
os.(*File).read(...)
/usr/local/go1.22.6-amd64/src/os/file_posix.go:29
os.(*File).Read(0xc000398030, {0xc0001aa000, 0x8000, 0x8000})
/usr/local/go1.22.6-amd64/src/os/file.go:118 +0xad
io.copyBuffer({0xcde0a0, 0xc0000fa160}, {0xcde5c0, 0xc000124000}, {0x0, 0x0, 0x0})
/usr/local/go1.22.6-amd64/src/io/io.go:429 +0x29b
io.Copy(...)
/usr/local/go1.22.6-amd64/src/io/io.go:388
os.genericWriteTo(0xc000398030, {0xcde0a0, 0xc0000fa160})
/usr/local/go1.22.6-amd64/src/os/file.go:269 +0x70
os.(*File).WriteTo(0xc000398030, {0xcde0a0, 0xc0000fa160})
/usr/local/go1.22.6-amd64/src/os/file.go:247 +0xd5
io.copyBuffer({0xcde0a0, 0xc0000fa160}, {0xcddf00, 0xc000398030}, {0x0, 0x0, 0x0})
/usr/local/go1.22.6-amd64/src/io/io.go:411 +0xd4
io.Copy(...)
/usr/local/go1.22.6-amd64/src/io/io.go:388
os/exec.(*Cmd).writerDescriptor.func1()
/usr/local/go1.22.6-amd64/src/os/exec/exec.go:580 +0x5b
os/exec.(*Cmd).Start.func2(0xc0002640e0)
/usr/local/go1.22.6-amd64/src/os/exec/exec.go:733 +0x3d
created by os/exec.(*Cmd).Start in goroutine 1
/usr/local/go1.22.6-amd64/src/os/exec/exec.go:732 +0x11c5
Thanks. That definitely looks like the parent process is waiting for the pipes to be closed by the child. That is, the kind of thing affected by WaitDelay
. I don't have an explanation for why the parent is still waiting, though.
Go version
go version go1.22.6 linux/amd64
Output of
go env
in your module/workspace:What did you do?
This captures the essence of what I'm doing when starting a process. Note that there are other goroutines, in particular a goroutine that is accepting HTTP CONNECT requests concurrently. I have not yet been able to reproduce the problem with a minimal programme.
I've taken a look at the
os/exec
implementation and haven't yet seen a potential bug, but given the core of what I'm doing is so simple, I have to wonder if there is a bug lurking somewhere in the standard library. It may be a subtle interaction with other file descriptors being created concurrently.What did you see happen?
The
Start()
succeeds, but sometimes theWait()
never returns, despite the process (and all its children) exiting. WhenWait()
never returns, I was able to capture the file descriptors in the child process before it exited:As you can see, stdout and stderr each have the write-side of their own pipe, as expected. The child process (and any dependents) definitely exit shortly afterwards; they do not appear in the process table.
Inside the parent process (the one using
os/exec
), these are the file descriptors:Both the read and write side of the stderr pipe are open, and the stdout pipe is nowhere to be found.
What did you expect to see?
The
Wait()
should return as soon as the process exists. Also, the read-side of the stdout and stderr pipes should be open in the parent process, but not the write-side.