golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.7k stars 17.5k forks source link

testing: writeProfiles is not called after panic #65129

Open felixge opened 7 months ago

felixge commented 7 months ago

Go version

go1.21.4

Output of go env in your module/workspace:

GO111MODULE='auto'
GOARCH='arm64'
GOBIN=''
GOCACHE='/Users/user/Library/Caches/go-build'
GOENV='/Users/user/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS='-count=1'
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/felixge/go/pkg/mod'
GONOPROXY='github.com/Org'
GONOSUMDB='github.com/Org,go.orgbuild.io'
GOOS='darwin'
GOPATH='/Users/user/go'
GOPRIVATE='github.com/Org'
GOPROXY='binaries.orgbuild.io,proxy.golang.org,direct'
GOROOT='/opt/homebrew/Cellar/go/1.21.4/libexec'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/opt/homebrew/Cellar/go/1.21.4/libexec/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.21.4'
GCCGO='gccgo'
AR='ar'
CC='clang'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/Users/user/go/src/github.com/felixge/dump/trace-panic/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/0t/nwrdnfnd2mjdgr0z_p_f6xww0000gq/T/go-build2519625148=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

Debug a test that hit a panic using go test -trace. Below is a greatly simplified example that reproduces the problem. Run it with go test -trace go.trace.

package main

import (
    "context"
    "fmt"
    "runtime/trace"
    "testing"
)

func TestTracePanic(t *testing.T) {
    // This simulates some trace buffers getting flushed before the test panics.
    for i := 0; i < 100000; i++ {
        trace.Log(context.Background(), "log", fmt.Sprintf("%d", i))
    }
    panic("panic")
}

What did you see happen?

$ go tool trace go.trace 
2024/01/17 10:56:02 Parsing trace...
failed to parse trace: no EvFrequency event

What did you expect to see?

A valid trace file that I can open.

Additional Thoughts

This reproduces with tip.

The problem seems to be that the after() func that is supposed to call writeProfiles doesn't get called when a test panics because tRunner() runs in a different goroutine.

gopherbot commented 7 months ago

Change https://go.dev/cl/556255 mentions this issue: testing: call writeProfile after panic

felixge commented 7 months ago

cc @prattmic

cagedmantis commented 7 months ago

@bcmills @golang/runtime

bcmills commented 7 months ago

I suspect that this could be fixed by calling m.after within tRunner when it is exiting by panic, more-or-less here: https://cs.opensource.google/go/go/+/master:src/testing/testing.go;l=1601-1606;drc=97daa6e94296980b4aa2dac93a938a5edd95ce93

But then the question is: how is tRunner supposed to get the M? It is normally at a higher level than a T or even a common. I suppose we could stick it in a func() field in one of those structs, but the need for that suggests a deeper dependency-inversion problem that we ought to think about how to solve.

felixge commented 7 months ago

@bcmills i implemented your suggestion (not at exactly at the same line, but that was a mistake done in haste) and it seems to work well. PTAL and let me know if you think the func() field is good enough or if the dependency inversion problem has you concerned.

felixge commented 6 months ago

I just noticed that the fix for #65319 has fixed the problem reported in this issue for execution traces . In fact, it even fixed the problem for the case where an arbitrary goroutine panics. My approach in CL 556255 would have only worked for panics in the goroutine calling the TestXXX function.

This raises the question if CL 556255 should proceed as is. I see three options:

  1. Modify the CL to take a similar approach where the runtime flushes the profiles on panics.
  2. Abandon it.
  3. Continue it as-is.

I don't think option 1 is workable. At the moment of a panic, the runtime is only aware of the fact that the CPU profile could be flushed. It has no idea that the testing package would also like to take snapshots of the alloc/heap, mutex, block and goroutine profiles.

Option 2 is worth considering. Execution traces are by far the most valuable data source you'd like to have when a Go program crashes. And arguably they also contain almost everything one would hope to find in a CPU, block, mutex or goroutine profile. However, they don't have any information about memory beyond heap size and heap target. So IMO there is still value in getting the alloc/heap profile written out when a test crashes.

Based on the above, I will go with option 3. However, I would not be heart broken if we decide to go with option 2 in order to keep the testing package simpler and b/c my CL doesn't work for arbitrary goroutines panics.

bcmills commented 6 months ago

In the event of an uncontrolled crash, I would expect that the most interesting case for a heap profile would be an OOM, but in case of an OOM failure I think the runtime throws instead of panicking.

Similarly, a mutex/blocking profile would be most useful if the test times out, but that doesn't result in a panic in the test either.

So I think in the interest of avoiding unnecessary complexity in the testing package (which it already has more than enough of!), I would prefer to abandon the CL and close this as a duplicate of #65319.

felixge commented 6 months ago

In the event of an uncontrolled crash, I would expect that the most interesting case for a heap profile would be an OOM, but in case of an OOM failure I think the runtime throws instead of panicking.

Linux uses the KILL signal to implement OOM killing. It gives the process no chance to dump any kind of diagnostics data. It's not even possible to get a core dump. It's a sad state of affairs 😞.

Similarly, a mutex/blocking profile would be most useful if the test times out, but that doesn't result in a panic in the test either.

Funny you mention this. The original problem that caused me to file this issue was a test that was timing out. This test didn't properly clean up after itself ... causing another test in the package to panic further down the road. This was all happening in CI and not locally reproducible. Hence the need for execution traces and other diagnostics data from the runtime.

So I think in the interest of avoiding unnecessary complexity in the testing package (which it already has more than enough of!), I would prefer to abandon the CL and close this as a duplicate of #65319.

My original issue was solved using execution tracing. We modified CI to only run the test that was timing out, which allowed us to capture a trace. This workaround is no longer needed, thanks to #65319. The remaining problems here are hypothetical for me, so I don't feel strongly enough about them to lobby you into accepting more complexity in the testing package.

That being said, if we close this, let's make it clear that this is only a partial duplicate of #65319. Writing out cpu/heap/mutex/goroutine profiles on test panics is still unique to this issue and doesn't overlap with #65319.