Support running tests using RR

firelizzard18 commented 1 year ago

It would be amazing if gotestsum supported running tests using rr. I've hacked something together to use with --raw-command but it's ugly. I am imagining something like gotestsum --engine rr (with the default being --engine go).

Update gotestsum to process packages separately
- Detect if the caller passed a specific package, or something ending in /...
- In the latter case, turn that into a list of packages
- I used go list -f '{{if .TestGoFiles}}{{.ImportPath}}{{end}}' $1 to list only the packages containing test files
Run the test with rr
- Add -exec 'rr record -o $TRACE' to the go test invocation
- This will tell go to execute the test binary using rr
- The trace location should be derived from the package path
- Overriding the output location makes it far easier to determine which trace belongs to which package
- Otherwise the trace for pkg/foo/bar will be written to ~/.local/share/rr/bar.test-0

firelizzard18 commented 1 year ago

I updated the description to reflect the fact that go test's -exec flag makes this much easier than the kludge I had put together. This is what I'm using now:

go-test-rr.go

```go package main import ( "fmt" "os" "os/exec" "path/filepath" "strings" ) func fatalf(format string, args ...interface{}) { fmt.Fprintf(os.Stderr, "Error: "+format+"\n", args...) os.Exit(1) } func check(err error) { if err != nil { fatalf("%v", err) } } func main() { // Make the trace directory check(os.MkdirAll("trace", 0700)) // List all packages that contain test files cmd := exec.Command("go", "list", "-f", "{{if .TestGoFiles}}{{.ImportPath}}{{end}}", "./...") out, err := cmd.CombinedOutput() check(err) // Test each one for _, pkg := range strings.Split(string(out), "\n") { pkg = strings.TrimSpace(pkg) pkg = "." + strings.TrimPrefix(pkg, "gitlab.com/accumulatenetwork/accumulate") run(pkg, os.Args[1:]) } } func run(pkg string, args []string) { // Clear the trace directory cwd, err := os.Getwd() check(err) dir := filepath.Join(cwd, "trace", pkg) check(os.RemoveAll(dir)) check(os.MkdirAll(filepath.Dir(dir), 0700)) // Compile the binary cmd := exec.Command("go", append([]string{"test", "-json", "-gcflags", "all=-N -l", "-exec", "rr record -o " + dir, pkg}, args...)...) cmd.Stdout = os.Stdout cmd.Stderr = os.Stderr check(cmd.Run()) } ```

dnephin commented 1 year ago

Thank you for suggesting this feature! rr looks very useful. The gotestsum --watch mode (docs) has a key shortcut for rerunning the failed package and dropping into delve. That integration with rr seems like it would be easy because watch mode already runs a single package at a time. d could drop into either rr or delve, but that is missing one piece, recording the original execution before dropping into rr replay.

How would you generally use this? Would you run the entire test suite of a project then use rr to replay any failures?

Would you use it in CI and save the trace files as artifacts? I see under limitations that it emulates a single core, so I imagine if you did run it in CI you'd want to do that as a completely separate job and not the primary CI test run. Any interesting CI failures could probably be reproduced locally by rerunning that package individually a large number of times. That might be easier than duplicating the test runs, and saving many trace artifacts, for the rare chance you need the debugger.

Maybe this could be integrated with a --record flag. In that mode gotestsum would list the packages to run, run each individually with -exec rr record ..., and save the trace files. When used with --watch that could indicate that d should use rr replay instead of dlv.

In both cases saving the file under the project directory sounds great. We could include a date and time in the filename instead of deleting it each time. That would produce a lot of files. Is there any reason to keep record files for successful runs? If we only keep the traces for a package with test failures that might help mitigate the problem of too many files. If someone were trying to trace down a flaky behaviour, they could run the tests in a loop many times and end up with only a few files for the failed runs. Maybe a --record=all could be used to keep the successful runs in the rare case, and the default value for --record would be failures.

How would that work for your usage or rr ?

firelizzard18 commented 1 year ago

How would you generally use this? Would you run the entire test suite of a project then use rr to replay any failures?

My motivation is flaky tests that rarely fail when run on a developer’s computer but sporadically fail in CI. I’ve had these kinds of failures show up periodically and they’re a huge pain to debug because they’re often almost impossible to reproduce reliably. When one of these tests fail, I want to download the trace and step through it to debug the failure.

I see under limitations that it emulates a single core, so I imagine if you did run it in CI you'd want to do that as a completely separate job and not the primary CI test run.

The way that limitation is phrased is a bit misleading I think. I also made the assumption that rr would be useless for reproducing concurrency issues.

A better way of putting it is that only one instruction is running at any given time, but the code can still be concurrent. It is true that I’ve had trouble reproducing concurrency issues when running in normal mode. However rr also has ‘chaos mode’ which is very helpful for reproducing races.

Is there any reason to keep record files for successful runs?

I only have a use for records of failed runs. I could invent a reason for using record of a successful run but I think in the end that would essentially mean a test was succeeding when it really should be failing.

What you suggest would work perfectly for me, especially with a flag for chaos mode (or just a pass through -rrflags). It would be particularly helpful if I could selectively enable chaos mode for specific packages (config file?). I’m working on distributed systems and I think my simulator has numerous concurrency bugs, so being able to selectively enable chaos mode would allow me to work on those bugs incrementally.

gotestyourself / gotestsum

Support running tests using RR #365