Open felixge opened 7 months ago
Change https://go.dev/cl/556255 mentions this issue: testing: call writeProfile after panic
cc @prattmic
@bcmills @golang/runtime
I suspect that this could be fixed by calling m.after
within tRunner
when it is exiting by panic, more-or-less here:
https://cs.opensource.google/go/go/+/master:src/testing/testing.go;l=1601-1606;drc=97daa6e94296980b4aa2dac93a938a5edd95ce93
But then the question is: how is tRunner
supposed to get the M
? It is normally at a higher level than a T
or even a common
. I suppose we could stick it in a func()
field in one of those structs, but the need for that suggests a deeper dependency-inversion problem that we ought to think about how to solve.
@bcmills i implemented your suggestion (not at exactly at the same line, but that was a mistake done in haste) and it seems to work well. PTAL and let me know if you think the func()
field is good enough or if the dependency inversion problem has you concerned.
I just noticed that the fix for #65319 has fixed the problem reported in this issue for execution traces . In fact, it even fixed the problem for the case where an arbitrary goroutine panics. My approach in CL 556255 would have only worked for panics in the goroutine calling the TestXXX
function.
This raises the question if CL 556255 should proceed as is. I see three options:
I don't think option 1 is workable. At the moment of a panic, the runtime is only aware of the fact that the CPU profile could be flushed. It has no idea that the testing package would also like to take snapshots of the alloc/heap, mutex, block and goroutine profiles.
Option 2 is worth considering. Execution traces are by far the most valuable data source you'd like to have when a Go program crashes. And arguably they also contain almost everything one would hope to find in a CPU, block, mutex or goroutine profile. However, they don't have any information about memory beyond heap size and heap target. So IMO there is still value in getting the alloc/heap profile written out when a test crashes.
Based on the above, I will go with option 3. However, I would not be heart broken if we decide to go with option 2 in order to keep the testing package simpler and b/c my CL doesn't work for arbitrary goroutines panics.
In the event of an uncontrolled crash, I would expect that the most interesting case for a heap profile would be an OOM, but in case of an OOM failure I think the runtime throws instead of panicking.
Similarly, a mutex/blocking profile would be most useful if the test times out, but that doesn't result in a panic in the test either.
So I think in the interest of avoiding unnecessary complexity in the testing
package (which it already has more than enough of!), I would prefer to abandon the CL and close this as a duplicate of #65319.
In the event of an uncontrolled crash, I would expect that the most interesting case for a heap profile would be an OOM, but in case of an OOM failure I think the runtime throws instead of panicking.
Linux uses the KILL signal to implement OOM killing. It gives the process no chance to dump any kind of diagnostics data. It's not even possible to get a core dump. It's a sad state of affairs 😞.
Similarly, a mutex/blocking profile would be most useful if the test times out, but that doesn't result in a panic in the test either.
Funny you mention this. The original problem that caused me to file this issue was a test that was timing out. This test didn't properly clean up after itself ... causing another test in the package to panic further down the road. This was all happening in CI and not locally reproducible. Hence the need for execution traces and other diagnostics data from the runtime.
So I think in the interest of avoiding unnecessary complexity in the
testing
package (which it already has more than enough of!), I would prefer to abandon the CL and close this as a duplicate of #65319.
My original issue was solved using execution tracing. We modified CI to only run the test that was timing out, which allowed us to capture a trace. This workaround is no longer needed, thanks to #65319. The remaining problems here are hypothetical for me, so I don't feel strongly enough about them to lobby you into accepting more complexity in the testing
package.
That being said, if we close this, let's make it clear that this is only a partial duplicate of #65319. Writing out cpu/heap/mutex/goroutine profiles on test panics is still unique to this issue and doesn't overlap with #65319.
Go version
go1.21.4
Output of
go env
in your module/workspace:What did you do?
Debug a test that hit a panic using
go test -trace
. Below is a greatly simplified example that reproduces the problem. Run it withgo test -trace go.trace
.What did you see happen?
What did you expect to see?
A valid trace file that I can open.
Additional Thoughts
This reproduces with tip.
The problem seems to be that the
after()
func that is supposed to callwriteProfiles
doesn't get called when a test panics becausetRunner()
runs in a different goroutine.