Allow re-running `panic`'ed tests through a runtime configuration flag

ddelnano commented 1 year ago

I'm using gotestsum to test a terraform provider that frequently has a very flaky acceptance test build (context in https://github.com/terra-farm/terraform-provider-xenorchestra/issues/188). The flakiness is due to tests becoming hung and eventually reaching go test's timeout value. Re-running these tests results in a successful build, but requires a significant amount of manual work to identify what should be rerun.

I've tested a fork of gotestsum that allows rerunning panic'ed tests (https://github.com/ddelnano/gotestsum/commit/8c70ff05e83bc93e5e042151a4d25343f1dfe4f8) and it appears to work the way I intend it to -- tests that timeout are re-run and result in a passing test build. I created a dummy test package to test this. The test package includes 3 tests and will allow one of the tests to proceed with execution every 10 seconds. If go test is given a timeout greater than 30s all three tests pass. If the timeout is between 20s and 30s, then 1 of the tests will timeout and panic.

I used this package to test the behavior of my fork and verified that the result is expected.

``` # Verify that gotestsum re-runs the panic'ed tests and reports a successful build (exit code 0) ddelnano@ddelnano-desktop:~/go/src/github.com/ddelnano/terraform-provider-xenorchestra$ ~/code/gotestsum/gotestsum --rerun-fails=2 --packages='github.com/ddelnano/terraform-provider-xenorchestra/cmd/testing/parallel' -- -v -count=1 -timeout=21s ✖ cmd/testing/parallel (21.016s) DONE 3 tests, 1 failure in 21.846s ✓ cmd/testing/parallel (10.031s) === Failed === FAIL: cmd/testing/parallel TestParallelHangging (unknown) DONE 2 runs, 4 tests, 1 failure in 32.883s ddelnano@ddelnano-desktop:~/go/src/github.com/ddelnano/terraform-provider-xenorchestra$ echo $? 0 # Force test to always fail and verify that gotestsum reports a failure ddelnano@ddelnano-desktop:~/go/src/github.com/ddelnano/terraform-provider-xenorchestra$ git diff diff --git a/cmd/testing/parallel/main_test.go b/cmd/testing/parallel/main_test.go index b47d341..1c9cfc9 100644 --- a/cmd/testing/parallel/main_test.go +++ b/cmd/testing/parallel/main_test.go @@ -38,4 +38,5 @@ func TestParallelHangging(t *testing.T) { func TestParallelSlightHangging(t *testing.T) { t.Parallel() waitForChannel(c) + t.Errorf("Failed") } ddelnano@ddelnano-desktop:~/go/src/github.com/ddelnano/terraform-provider-xenorchestra$ ~/code/gotestsum/gotestsum --rerun-fails=2 --packages='github.com/ddelnano/terraform-provider-xenorchestra/cmd/testing/parallel' -- -v -count=1 -timeout=21s || echo $? ✖ cmd/testing/parallel (21.008s) DONE 3 tests, 2 failures in 21.384s ✖ cmd/testing/parallel (10.014s) ✓ cmd/testing/parallel (10.012s) DONE 2 runs, 5 tests, 3 failures in 42.515s ✖ cmd/testing/parallel (10.011s) === Failed === FAIL: cmd/testing/parallel TestParallelSlightHangging (20.01s) main_test.go:41: Failed panic: test timed out after 21s goroutine 19 [running]: testing.(*M).startAlarm.func1() /usr/lib/go/src/testing/testing.go:2036 +0x8e created by time.goFunc /usr/lib/go/src/time/sleep.go:176 +0x32 goroutine 1 [chan receive]: testing.tRunner.func1() /usr/lib/go/src/testing/testing.go:1412 +0x4a5 testing.tRunner(0xc000116680, 0xc000106c78) /usr/lib/go/src/testing/testing.go:1452 +0x144 testing.runTests(0xc0001300a0?, {0x605400, 0x3, 0x3}, {0x4a6be9?, 0x7f497f542fff?, 0x609ca0?}) /usr/lib/go/src/testing/testing.go:1844 +0x456 testing.(*M).Run(0xc0001300a0) /usr/lib/go/src/testing/testing.go:1726 +0x5d9 github.com/ddelnano/terraform-provider-xenorchestra/cmd/testing/parallel.TestMain(0x7f49a6674a68?) /home/ddelnano/go/src/github.com/ddelnano/terraform-provider-xenorchestra/cmd/testing/parallel/main_test.go:21 +0x5e main.main() _testmain.go:53 +0x1d3 goroutine 34 [sleep]: time.Sleep(0x2540be400) /usr/lib/go/src/runtime/time.go:195 +0x135 github.com/ddelnano/terraform-provider-xenorchestra/cmd/testing/parallel.TestMain.func1() /home/ddelnano/go/src/github.com/ddelnano/terraform-provider-xenorchestra/cmd/testing/parallel/main_test.go:16 +0x2c created by github.com/ddelnano/terraform-provider-xenorchestra/cmd/testing/parallel.TestMain /home/ddelnano/go/src/github.com/ddelnano/terraform-provider-xenorchestra/cmd/testing/parallel/main_test.go:14 +0x54 goroutine 36 [chan receive]: github.com/ddelnano/terraform-provider-xenorchestra/cmd/testing/parallel.waitForChannel(...) /home/ddelnano/go/src/github.com/ddelnano/terraform-provider-xenorchestra/cmd/testing/parallel/main_test.go:25 github.com/ddelnano/terraform-provider-xenorchestra/cmd/testing/parallel.TestParallelHangging(0x0?) /home/ddelnano/go/src/github.com/ddelnano/terraform-provider-xenorchestra/cmd/testing/parallel/main_test.go:35 +0x27 testing.tRunner(0xc000116b60, 0x531050) /usr/lib/go/src/testing/testing.go:1446 +0x10b created by testing.(*T).Run /usr/lib/go/src/testing/testing.go:1493 +0x35f === FAIL: cmd/testing/parallel TestParallelHangging (unknown) === FAIL: cmd/testing/parallel TestParallelSlightHangging (re-run 1) (10.01s) main_test.go:41: Failed === FAIL: cmd/testing/parallel TestParallelSlightHangging (re-run 2) (10.01s) main_test.go:41: Failed DONE 3 runs, 6 tests, 4 failures in 53.302s 1 ```

While working on this, I realized that there is a history on handling this situation (https://github.com/gotestyourself/gotestsum/pull/192 and https://github.com/golang/go/issues/45508), so I know it was intentional that these tests aren't rerun but I'd like to make it possible to enable this behavior behind a new cli flag (--rerun-fails-rerun-panics or something similar).

I'm opening this issue to discuss if a PR that implements this would be accepted. If that is the case, I'm happy to take on this work myself.

dnephin commented 1 year ago

Thanks for opening this issue! The reason gotestsum can't re-run tests after a panic is because it gets the list of tests to re-run from the list of tests that failed on the first run. When there is a panic in one test go test stops running tests. There may be one or more tests that never ran, so the list of test failures is not reliable. Re-running the ones that failed up to that point makes it look like it was successful, but some tests may never have run.

Do you have a way to address that problem? I had a quick read over the linked issue and did not see it.

Another approach to re-running tests would be to go test -list, and use that to determine which tests to re-run. Running go test -list can be kind of slow, and is more work to manage the list of tests, so I've avoided that approach so far.

Would it be possible to get the tests themselves to fail instead of hitting the -timeout that causes a panic? If the tests use os/exec.Cmd and context.Context it should be possible to set a timeout for each test.

dnephin commented 1 year ago

Looking at https://github.com/hashicorp/terraform-plugin-sdk/blob/v2.24.1/helper/resource/testing.go#L713-L717, it seems like it would be pretty easy to add a Timeout field to TestCase, and use that to set a deadline on the context.Context. That seems like it should allow for individual tests to timeout as test failures.

ddelnano commented 1 year ago

There may be one or more tests that never ran, so the list of test failures is not reliable. Re-running the ones that failed up to that point makes it look like it was successful, but some tests may never have run.

ah, I see the issue now. From the GitHub issues I shared about, it seemed that supporting this had correctness concerns but it wasn't until your most recent comment that I understood it.

Another approach to re-running tests would be to go test -list, and use that to determine which tests to re-run. Running go test -list can be kind of slow, and is more work to manage the list of tests, so I've avoided that approach so far.

I tried to write my own test runner and was using go test -list as an input (via stdin) to my test running go cmd. I ran into some issues with the development of that and decided it would be quicker to override the gotestsum behavior.

Maybe gotestsum could have an optional setting to collect tests via go test -list instead of the first test run if this panic setting were enabled? When you say "is more work to manage the list of tests", would gotestsum need large changes to support this?

Would it be possible to get the tests themselves to fail instead of hitting the -timeout that causes a panic? If the tests use os/exec.Cmd and context.Context it should be possible to set a timeout for each test.

This seems like a great option to evaluate and was something I hadn't considered before. I'll look into that and see if I can provide that change upstream.

dnephin commented 1 year ago

I think it would be interesting to implement the go test -list approach in gotestsum, since https://github.com/golang/go/issues/45508 does not appear to be going anywhere, and we now have a use case that isn't solve by that change anyway.

I guess it shouldn't be too much work to implement. Maybe something like this:

add a new --rerun-fails-list to enable this new mode. In the future maybe we can make this the default, but initially I think we should make it a separate option in case the list does slow down runs for some use cases.
in cmd/main.go:Run handle the --rerun-fails-list flag like we do watch and version, in the switch/case call a new function for runRerunFromList.
runRerunFromList would need to call opts.Validate then do the go test -list with appropriate args (this is likely the most challenging bit)
Once we have the full list of tests, I guess we need a function similar to rerunFailed, that will mark successful tests as completed, and start a new run for all the failures. This is likely the most work. It's not possible to run all the tests at once, so the re-run needs to group tests by package, and run the tests for each package.

I'm not sure which args can be passed to go test when -list is used. At the very least we need to make sure the -list argument is injected before the list of packages. This mode is also going to be incompatible with --raw-command, because we won't know how to build the go test -list command line in that case. We'll have to make it an error to use --raw-command with --rerun-fails-list.

I likely won't have a lot of time to work on this feature. I'm happy to review an early draft and help get things working, but it could be a while before I'm able to get things started. If you are interested in implementing the feature, maybe you can re-use some of the pieces you built for your own test runner?

nickh-stripe commented 1 year ago

it would be great to have some more controls of this, esp considering 'HasPanic' is driven by a condition on the log output containing panic:, which is a fairly indirect method of detection. In my example the test forks a child process per test, so a child process panicing has no impact on the correctness of the list of tests and shouldn't really effect reruns.. but it does

dnephin commented 1 year ago

I would love to have a more reliable way to detect a panic. I'm definitely not happy with having to look at the string prefix. I attempted to fix this in https://github.com/golang/go/issues/45508 (https://go-review.googlesource.com/c/go/+/309250), but I never got a response to my explanation. I still believe my proposed fix is consistent with go run and other commands in the go toolchain. If go test receives an exit code 2 from any test binary, it should also exit with code 2, which would let us detect the panic here from the exit code.

Maybe if others comment on that thread someone will notice and we can make progress on fixing this.

nickh-stripe commented 1 year ago

yep its a difficult problem to solve completely, requiring upstream support as you've noted. My ask for this is just to have more control of the gotestsum behaviour, a setting to ignore panics and do reruns (because in my case its safe..) seems possible?

dnephin commented 1 year ago

Is it always safe in your case? If your test suite uses os/exec.Cmd to start the processes I assume there is a non-zero amount of code that runs in the regular test goroutine before starting the sub-process. If any of that code panic'ed I think you would have the same problem. Is that how you're achieving the process-per-test?

The code to check panic: is here:

https://github.com/gotestyourself/gotestsum/blob/049fc26eed877666573a323de5917cf3365e0866/testjson/execution.go#L199-L200

It only matches when a line of output starts with panic:, so if you have any control over the output it would be easy enough to prefix the output with a tab or a space to prevent prevent the output from being detected as a panic.

I'm not happy with having to string match on panic: as it is, so I'd much rather find a better mechanism to detect panics than to add a new flag to work around it.

nickh-stripe commented 12 months ago

what about an opt-in option to run -test.list to enumerate the full list of tests before execution? so a slower but more correct impl that ensures you have the full list of tests available. So execution would have two phases rather than currently just one, an initial -test.list to find all the tests, then the as-is behaviour to execute the tests, with the test.list step informing rerun behaviour?

Could have some weird interactions with the go test args passed after -- 🤔

dnephin commented 12 months ago

I'd be happy to review a -test.list implementation, but I'm still not convinced it would be more correct.

For one, go test -list only prints root test names, so there's no way it could implement the behaviour that exists today that allows re-running sub-tests. Also, as you mentioned, getting the right args and build flags to ensure the correct test list is not easy.

A fix in the stdlib I think would be the most reliable, and prefixing the output from a subprocess seems like it would be an easier fix.

I found a bunch of packages that can do that. Assuming you're using os/exec.Cmd to run the subprocess any of these io.Writer can be attached to the cmd.Stdout and should work to hide the panic from gotestsum. Or you could use one of their implementations for inspiration and write your own version. A couple of these implementations are only ~50 lines.

I don't know much about your setup, or how you're accomplishing the "run tests in subprocesses", but if you want to share more about that maybe we can find other solutions that are easier than re-implementing all the rerun logic.

gotestyourself / gotestsum

Allow re-running `panic`'ed tests through a runtime configuration flag #290