golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.79k stars 17.64k forks source link

os/exec: possible race handling stdout&stderr pipes #69060

Open rgooch opened 2 months ago

rgooch commented 2 months ago

Go version

go version go1.22.6 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/rgooch/.cache/go-build'
GOENV='/home/rgooch/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/rgooch/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/rgooch/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go1.22.6-amd64'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go1.22.6-amd64/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.22.6'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/rgooch/git/Dominator/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build438858212=/tmp/go-build -gno-record-gcc-switches'

What did you do?

stderr := &strings.Builder{}
stdout := &strings.Builder{}
cmd := exec.Command(path, args...)
cmd.Env = make([]string, 0)
cmd.Stderr = stderr
cmd.Stdout = stdout
if err := cmd.Start(); err != nil {
    return err
}
if err := cmd.Wait(); err != nil {
    return err
}

This captures the essence of what I'm doing when starting a process. Note that there are other goroutines, in particular a goroutine that is accepting HTTP CONNECT requests concurrently. I have not yet been able to reproduce the problem with a minimal programme.

I've taken a look at the os/exec implementation and haven't yet seen a potential bug, but given the core of what I'm doing is so simple, I have to wonder if there is a bug lurking somewhere in the standard library. It may be a subtle interaction with other file descriptors being created concurrently.

What did you see happen?

The Start() succeeds, but sometimes the Wait() never returns, despite the process (and all its children) exiting. When Wait() never returns, I was able to capture the file descriptors in the child process before it exited:

lr-x------ 1 0 0 64 0 -> /dev/null
l-wx------ 1 0 0 64 1 -> pipe:[61465]
l-wx------ 1 0 0 64 2 -> pipe:[61466]
lr-x------ 1 0 0 64 10 -> /tmp/test.sh

As you can see, stdout and stderr each have the write-side of their own pipe, as expected. The child process (and any dependents) definitely exit shortly afterwards; they do not appear in the process table.

Inside the parent process (the one using os/exec), these are the file descriptors:

lrwx------    1 0        0               64 Aug 25 03:41 0 -> /dev/console
lrwx------    1 0        0               64 Aug 25 03:41 1 -> /dev/console
lrwx------    1 0        0               64 Aug 25 03:42 10 -> /dev/pts/0
lrwx------    1 0        0               64 Aug 25 03:42 11 -> socket:[23564]=
lrwx------    1 0        0               64 Aug 25 03:41 2 -> /dev/console
lrwx------    1 0        0               64 Aug 25 03:41 3 -> socket:[15381]=
lrwx------    1 0        0               64 Aug 25 03:41 4 -> anon_inode:[eventpoll]
lr-x------    1 0        0               64 Aug 25 03:41 5 -> pipe:[61446]|
l-wx------    1 0        0               64 Aug 25 03:41 6 -> pipe:[61446]|
l-wx------    1 0        0               64 Aug 25 03:42 7 -> /tmp/log
lrwx------    1 0        0               64 Aug 25 03:42 8 -> socket:[61457]=
lrwx------    1 0        0               64 Aug 25 03:41 9 -> /dev/ptmx

Both the read and write side of the stderr pipe are open, and the stdout pipe is nowhere to be found.

What did you expect to see?

The Wait() should return as soon as the process exists. Also, the read-side of the stdout and stderr pipes should be open in the parent process, but not the write-side.

gabyhelp commented 2 months ago

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

seankhliao commented 2 months ago

I think we'll need a reproducer to do anything about this report.

ianlancetaylor commented 2 months ago

You didn't really say what the child process is doing, and in particular whether it might itself be spawing other processes. See the discussion at https://pkg.go.dev/os/exec#Cmd.WaitDelay. Could this be related to that?

rgooch commented 2 months ago

The subprocess is quite simple: it's a BusyBox shell (ash) script:

#! /tmp/bb/ash

set -eu

ls -Fl /proc/$$/fd > /tmp/test.fds

sleep 0.05

exit 0

I know about WaitDelay, but that should not be needed because the process and all children exit promptly.

ianlancetaylor commented 2 months ago

Thanks.

Both the read and write side of the stderr pipe are open, and the stdout pipe is nowhere to be found.

I may be misreading, but it seems to me that the pipes in the child are 61465 and 61466, whereas the pipe in the parent is 61446, which is different from both. That is, the parent doesn't seem to have either of the pipes in the child, though it does have a different unrelated pipe.

If you send the hanging process a SIGQUIT, or press the ^\ keys, do you get a stack trace of where the program is hanging? It might help to know exactly which system call is not returning. Thanks.

rgooch commented 2 months ago

Ah, good eye. Here are extracts of the relevant stack trace:

goroutine 1 [chan receive, 1246 minutes]:
os/exec.(*Cmd).awaitGoroutines(0xc0003fc180, 0x0)
    /usr/local/go1.22.6-amd64/src/os/exec/exec.go:957 +0x3b1
os/exec.(*Cmd).Wait(0xc0003fc180)
    /usr/local/go1.22.6-amd64/src/os/exec/exec.go:924 +0x2d0

goroutine 181 [IO wait, 1246 minutes]:
internal/poll.runtime_pollWait(0x7fc1d4e7acc8, 0x72)
    /usr/local/go1.22.6-amd64/src/runtime/netpoll.go:345 +0x85
internal/poll.(*pollDesc).wait(0xc000690140, 0x72, 0x1)
    /usr/local/go1.22.6-amd64/src/internal/poll/fd_poll_runtime.go:84 +0xb1
internal/poll.(*pollDesc).waitRead(...)
    /usr/local/go1.22.6-amd64/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000690120, {0xc00028c000, 0x8000, 0x8000})
    /usr/local/go1.22.6-amd64/src/internal/poll/fd_unix.go:164 +0x466
os.(*File).read(...)
    /usr/local/go1.22.6-amd64/src/os/file_posix.go:29
os.(*File).Read(0xc000398008, {0xc00028c000, 0x8000, 0x8000})
    /usr/local/go1.22.6-amd64/src/os/file.go:118 +0xad
io.copyBuffer({0xcde0a0, 0xc0000fa000}, {0xcde5c0, 0xc00048e010}, {0x0, 0x0, 0x0})
    /usr/local/go1.22.6-amd64/src/io/io.go:429 +0x29b
io.Copy(...)
    /usr/local/go1.22.6-amd64/src/io/io.go:388
os.genericWriteTo(0xc000398008, {0xcde0a0, 0xc0000fa000})
    /usr/local/go1.22.6-amd64/src/os/file.go:269 +0x70
os.(*File).WriteTo(0xc000398008, {0xcde0a0, 0xc0000fa000})
    /usr/local/go1.22.6-amd64/src/os/file.go:247 +0xd5
io.copyBuffer({0xcde0a0, 0xc0000fa000}, {0xcddf00, 0xc000398008}, {0x0, 0x0, 0x0})
    /usr/local/go1.22.6-amd64/src/io/io.go:411 +0xd4
io.Copy(...)
    /usr/local/go1.22.6-amd64/src/io/io.go:388
os/exec.(*Cmd).writerDescriptor.func1()
    /usr/local/go1.22.6-amd64/src/os/exec/exec.go:580 +0x5b
os/exec.(*Cmd).Start.func2(0xc0002640a0)
    /usr/local/go1.22.6-amd64/src/os/exec/exec.go:733 +0x3d
created by os/exec.(*Cmd).Start in goroutine 1
    /usr/local/go1.22.6-amd64/src/os/exec/exec.go:732 +0x11c5

goroutine 182 [IO wait, 1246 minutes]:
internal/poll.runtime_pollWait(0x7fc1d4e7aad8, 0x72)
    /usr/local/go1.22.6-amd64/src/runtime/netpoll.go:345 +0x85
internal/poll.(*pollDesc).wait(0xc000690200, 0x72, 0x1)
    /usr/local/go1.22.6-amd64/src/internal/poll/fd_poll_runtime.go:84 +0xb1
internal/poll.(*pollDesc).waitRead(...)
    /usr/local/go1.22.6-amd64/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc0006901e0, {0xc0001aa000, 0x8000, 0x8000})
    /usr/local/go1.22.6-amd64/src/internal/poll/fd_unix.go:164 +0x466
os.(*File).read(...)
    /usr/local/go1.22.6-amd64/src/os/file_posix.go:29
os.(*File).Read(0xc000398030, {0xc0001aa000, 0x8000, 0x8000})
    /usr/local/go1.22.6-amd64/src/os/file.go:118 +0xad
io.copyBuffer({0xcde0a0, 0xc0000fa160}, {0xcde5c0, 0xc000124000}, {0x0, 0x0, 0x0})
    /usr/local/go1.22.6-amd64/src/io/io.go:429 +0x29b
io.Copy(...)
    /usr/local/go1.22.6-amd64/src/io/io.go:388
os.genericWriteTo(0xc000398030, {0xcde0a0, 0xc0000fa160})
    /usr/local/go1.22.6-amd64/src/os/file.go:269 +0x70
os.(*File).WriteTo(0xc000398030, {0xcde0a0, 0xc0000fa160})
    /usr/local/go1.22.6-amd64/src/os/file.go:247 +0xd5
io.copyBuffer({0xcde0a0, 0xc0000fa160}, {0xcddf00, 0xc000398030}, {0x0, 0x0, 0x0})
    /usr/local/go1.22.6-amd64/src/io/io.go:411 +0xd4
io.Copy(...)
    /usr/local/go1.22.6-amd64/src/io/io.go:388
os/exec.(*Cmd).writerDescriptor.func1()
    /usr/local/go1.22.6-amd64/src/os/exec/exec.go:580 +0x5b
os/exec.(*Cmd).Start.func2(0xc0002640e0)
    /usr/local/go1.22.6-amd64/src/os/exec/exec.go:733 +0x3d
created by os/exec.(*Cmd).Start in goroutine 1
    /usr/local/go1.22.6-amd64/src/os/exec/exec.go:732 +0x11c5
ianlancetaylor commented 2 months ago

Thanks. That definitely looks like the parent process is waiting for the pipes to be closed by the child. That is, the kind of thing affected by WaitDelay. I don't have an explanation for why the parent is still waiting, though.