Open findleyr opened 1 year ago
Two observations:
Kill
. If killing failed, I don't think we can expect Wait to return.The only reason the kill system call can fail (at least in this situation) is when the child process has already exited, so failure of kill is unlikely to be the culprit. More likely kill terminated the go process itself, but not the tree of processes rooted at it. If one of them (a test?) retains an open file descriptor to the stdout pipe created by os/exec then the cmd.Run operation will hang indefinitely.
To dig further, we could add logic to run during the failure (on linux) that does ps -e f
and ls -l /proc/*/fd/*
to show the process tree and their open files.
Please do check the error from Kill
ā that would at least give a useful data point for diagnostics. (This may be closely related to #53400 #50187.)
greplogs -l -e 'panic: test timed out(?:.*\n)*goroutine \d+ \[.*, \d+ minutes\]:\n.*runCmdContext' --since=2022-08-15 --details
(66 matching logs)
Interesting, I was debugging this in https://go.dev/cl/424075. On windows, our call to Process.Kill() is failing with "invalid argument": https://storage.googleapis.com/go-build-log/e2f60200/windows-386-2008_71ad7007.log
A bit of googling suggests that this is because we can't kill subprocesses on windows.
@bcmills any advice for how to properly kill the go command on windows?
After reading the source a bit more: this is EINVAL, which appears to mean that the Process.wait() has exited and the handle released, so this is a race, although it is surprising that we hit it so reliably.
Change https://go.dev/cl/424075 mentions this issue: internal/gocommand: add instrumentation for hanging go commands
any advice for how to properly kill the go command on windows?
Can't be done without creating a whole extra process group, unfortunately. (Probably we should add a side-channel ā perhaps an open file descriptor or a pidfile? ā to request clean shutdown on Windows.)
Change https://go.dev/cl/431075 mentions this issue: internal/gocommand: tweak debugging for hanging go commands
Nice. Well, that test process seems very much alive, falsifying my hypothesis.
13032 12910 (compile)
9802 13032 go list -modfile=/tmp/workdir/tmp/go.8f5d4cc245da9790bda529e64ae3e7
Looks like the hanging go command is in the middle of a compile. Wish we had the full subprocesss command line -- I'll look into that.
Not sure how to interpret the fstat output.
That's a dead cmd/compile process: there's no command because argv has been destroyed along with the rest of the address space. Perhaps the go list parent simply hasn't called waitpid yet, so the process table entry has to be retained. I suspect the problem is in go list.
Aha, thanks (excuse my ps noobness).
Note that we instrumented this panic in two places: once before Kill()
and once after. This is before, so I don't think it falsifies your hypothesis, if I'm understanding correctly.
That one is netbsd
, which could plausibly be explained as #50138. (I'd rather we wait for one of these on a platform without a known issue for Wait
before we do too much digging.)
greplogs --dashboard -md -l -e 'HANGING GO' --since=2022-09-15
2022-09-17T02:56:51-4d18923-cc1b20e/netbsd-amd64-9_0 2022-09-16T17:59:31-a61f20e-a84f46a/netbsd-amd64-9_0 2022-09-16T16:33:11-a61f20e-8451529/netbsd-386-9_0 2022-09-16T14:49:13-a61f20e-b35b4a3/netbsd-amd64-9_0
Still only netbsd. Posting the greplogs
query here for future reference.
Ooh, nice! https://go.dev/issue/55323#issuecomment-1254107802 has a darwin/arm64
hang. š
As expected, cmd/go
itself has already exited at the point of the hang:
2022/09/20 08:42:37 error killing the Go command: os: process already finished
But there are an awful lot of processes in that dump, and nothing that obviously stands out as a subprocess of cmd/go
that would have access to stdout
or stderr
. š
Here's an off-the-wall theory: maybe the file descriptors actually are closed, and something in the parent Go process (i.e. gopls
) isn't waking up from the end-of-stream event..?
To check that hypothesis, two suggestions:
gopls
should log exactly which PID was the stuck go
command, just in case it left a remnant in the ps
orlsof
output that we can track down.gopls
should invoke debug.SetTraceback("system")
before the terminal panic
here.A simpler explanation is that gopls sent SIGINT, waited for 1s, then sent SIGKILL, and the child (go) exited during the 1s grace period, causing kill(2) to fail with ESRCH (-> ErrProcessDone = "process already finished"). Seems like that's to be expected if a busy machine causes a delay in responding to SIGINT. Perhaps the debugging logic should treat ErrProcessDone as success, not a reason to debug?
@adonovan but in this case the panic indicates that we waited up to 10 seconds after that kill. The error from Kill is merely logged.
Per @bcmills comment on the CL, we should expect that cmd.Wait
returns quickly once the process is done.
the panic indicates that we waited up to 10 seconds after that kill. The error from Kill is merely logged.
Good point.
So the sequence of events that we know must have happened from the log is:
ErrProcessDone means that either waitpid returned ESRCH or that this was the second call to Process.signal after a first one set the 'done' event at the end of a successful call to Process.wait. In either case, the process must already have terminated.
Is it possible that the dump shows a different go process? Seems improbable, but I feel like we've eliminated the impossible. We should log the pid of cmd.Process, like @bcmills suggested.
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Change https://go.dev/cl/434637 mentions this issue: internal/gocommand: show pid of process
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Is it possible that the dump shows a different go process? Seems improbable, but I feel like we've eliminated the impossible. We should log the pid of cmd.Process, like @bcmills suggested.
The pid in the last log above clearly shows that that the go list process is the child of the test, not some other process in the container. Huh...
We still haven't seen a single log with both (a) the "error killing the Go command" message and (b) a pid showing that 'go' is the child of the test. So it's possible that the log with just (a) was merely a very slow (netbsd) machine and the log with just (b) was showing a different go process in the ps output. If both were to occur in the same log it would indicate a bug in the system call.
merely a very slow (netbsd) machine
The NetBSD machines aren't particularly slow; they're buggy (#50138 up until yesterday, and #56180 after that).
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
From the last log, we're seeing that go list
(pid 11323) is slow to complete, more than 1 minute, on the netbsd builder. We don't have the complete go list command, but we can tell from the output of ps that it is definitely running the compiler, which is surprising, as that should only be needed in -export mode, which gopls has no need for. Notice the three "(compile)" child processes in ps, plus all gopls' in/out/err pipes in fstat. The compile processes have all terminated (ps cannot display their argv arrays because the address spaces have been destroyed), yet the go list process has not closed its pipes.
ps axo ppid,pid,command:
-------------------------
PPID PID COMMAND
...
314 6958 /tmp/workdir/go/bin/go test -short ./...
6958 10212 /tmp/workdir/tmp/go-build957980754/b501/completion.test -test.testl
11323 10917 (compile)
10212 11323 go list -modfile=/tmp/workdir/tmp/go.e8b79c677cbf58581b70654d24944f [truncated]
11323 11598 (compile)
11323 11901 (compile)
...
fstat:
-----
USER CMD PID FD MOUNT INUM MODE SZ|DV R/W
...
root go 11323 wd / 1622282 drwxr-xr-x 512 r
root go 11323 0 / 1713612 crw-rw-rw- null r
root go 11323 1* pipe 0xffff80320e6911a0 -> 0xffff803233ae22c0 w
root go 11323 2* pipe 0xffff8031e6711d98 -> 0xffff803233ae2ea0 w
root go 11323 3 / 1400146 -rw-r--r-- 86547 r
root go 11323 4 / 1442046 -rw-r--r-- 1115 r
root go 11323 5 / 1448217 -rw-r--r-- 186 r
root go 11323 6 / 1442053 -rw-r--r-- 3407 r
root go 11323 7 / 1442035 -rw-r--r-- 195 r
root go 11323 8 / 1442034 -rw-r--r-- 5573 r
root go 11323 9 / 1442142 -rw-r--r-- 3069 r
root go 11323 10 / 1442405 -rw-r--r-- 133 r
root go 11323 11 / 1442062 -rw-r--r-- 2864 r
root go 11323 12 / 1442418 -rw-r--r-- 1972 r
root go 11323 13 / 1442427 -rw-r--r-- 3924 r
root go 11323 14 / 1348570 -rw-r--r-- 393 r
root go 11323 15 / 1442421 -rw-r--r-- 1036 r
root go 11323 16* pipe 0xffff8031a85218d0 <- 0x0 rn
root go 11323 17* kqueue pending 0
root go 11323 18* pipe 0xffff803253411638 <- 0xffff80325f6a22a8 rn
root go 11323 19* pipe 0xffff80325f6a22a8 -> 0xffff803253411638 wn
root go 11323 20* pipe 0xffff8032539119d0 <- 0x0 rn
root go 11323 21* pipe 0xffff8030f7b30c70 <- 0x0 rn
root go 11323 22* pipe 0xffff803283de7e80 <- 0x0 rn
root go 11323 23* pipe 0xffff8031d28892d8 <- 0x0 rn
root go 11323 24 / 1399130 -rw-r--r-- 329 r
root go 11323 27* pipe 0xffff8030f7b30b40 <- 0x0 rn
root go 11323 28* pipe 0xffff80319abc1c68 <- 0x0 rn
root go 11323 29* pipe 0xffff8030db831e80 <- 0x0 rn
root go 11323 34* pipe 0xffff803233637c48 <- 0x0 rn
root go 11323 36 / 1713612 crw-rw-rw- null r
root go 11323 37* pipe 0xffff8030f7b30090 <- 0xffff80304c451af0 rn
root go 11323 38* pipe 0xffff80304c451af0 -> 0xffff8030f7b30090 wn
root go 11323 39* pipe 0xffff8031a8521670 <- 0xffff80304c451170 rn
root go 11323 40* pipe 0xffff80304c451170 -> 0xffff8031a8521670 wn
...
The NetBSD hangs at this point are almost certainly a manifestation of #56180.
One of the compile
subprocesses could plausibly be just go tool compile -V=full
, used to obtain the cache ID for the compiler:
https://cs.opensource.google/go/go/+/master:src/cmd/go/internal/work/buildid.go;l=162;drc=85b02de083b5e337bcb581005d0efca883fdb3d9
Unfortunately, without a more complete command line (or a SIGQUIT
dump) I don't have a theory for what the other two compile
processes might be.
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Yes, that does look like the explanation. (Nice work tracking that one down, BTW.)
I'll limit my attention to non-netbsd platforms going forward.
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
Found new dashboard test flakes for:
#!watchflakes
post <- pkg ~ `^golang.org/x/tools/gopls` &&
`DETECTED A HANGING GO COMMAND`
ā watchflakes
We recently started waiting for all go command invocations when shutting down gopls regtests. It appears that sometimes we kill the go command and still don't get a result from
cmd.Wait()
. For example, here: https://build.golang.org/log/00046e0b005c7660d676a3a415561950048f756aIn that failure, the test runner looks otherwise healthy (other tests ran fast), and yet the goroutine stack clearly shows a go command hanging for 9 minutes here: https://cs.opensource.google/go/x/tools/+/master:internal/gocommand/invoke.go;l=260;drc=f38573358cbedf46d64c5759ef41b72afcf0c5c0
@bcmills do you happen to have any idea of what might cause this?