golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.34k stars 17.7k forks source link

testing: shuffle seed should be different when -shuffle=on and -count flag is set #61256

Open cristaloleg opened 1 year ago

cristaloleg commented 1 year ago

This proposal is a clarification of the original test shuffle flag proposal https://github.com/golang/go/issues/28592

Today I was debugging tests that were rarely failing due to cross-dependency between test functions. To verify the fix I noticed that go test -shuffle=on -count=100 runs tests in the same order as the first iteration. In other words -test.shuffle value is generated once and is reused for the other 99 test runs.

This makes sense when the shuffle seed is set explicitly (-shuffle=X ) but for random value (-shuffle=on) we should generate a new one to increase test order randomness to cover more combinations.

Current documentation for count and shuffle flags doesn't mention their relation, so the proposal (if accepted) will not break go test behaviour (docs from current master branch):

//  -count n
//      Run each test, benchmark, and fuzz seed n times (default 1).
//      If -cpu is set, run n times for each GOMAXPROCS value.
//      Examples are always run once. -count does not apply to
//      fuzz tests matched by -fuzz.
//

...

//
//  -shuffle off,on,N
//      Randomize the execution order of tests and benchmarks.
//      It is off by default. If -shuffle is set to on, then it will seed
//      the randomizer using the system clock. If -shuffle is set to an
//      integer N, then N will be used as the seed value. In both cases,
//      the seed will be reported for reproducibility.
//

Thank you.

cristaloleg commented 1 year ago

@rsc forgive me for a ping but can this be included into review meetings?

rsc commented 1 year ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

rsc commented 1 year ago

There is also a strange detail that -count=3 with tests A,B,C runs A,B,C,A,B,C,A,B,C, but with benchmarks X,Y,Z you get X,X,X,Y,Y,Y,Z,Z,Z. I wonder why we do that, and whether we should run the tests A,A,A,B,B,B,C,C,C too.

I agree that if we keep the tests A,B,C,A,B,C,A,B,C then we should at least shuffle the rounds independently.

But maybe for -shuffle=on -count=N we should make the full list of N*M things that will run and shuffle that entire list, so there is no longer a "round" boundary anywhere.

cristaloleg commented 1 year ago

tests A,B,C runs A,B,C,A,B,C,A,B,C, but with benchmarks X,Y,Z you get X,X,X,Y,Y,Y,Z,Z,Z.

For me this sounds like a separate proposal (however, running benchmark N times and switching to another sounds quite natural).

the full list of N*M things that will run and shuffle that entire list

Great point, this might also catch cases when running test twice in a row is buggy (pre/post-test initialisation contains a bug)

rsc commented 1 year ago

It sounds like maybe the proposal should be that -shuffle=on -count=N handles tests and benchmarks the same way, which is make a list of all the tests (alternately, benchmarks) with N copies each, and shuffle that entire list, and then run them in that order.

If you're running tests and benchmarks, they still run as two separate phases: first tests, then benchmarks.

If -cpu is involved, each cpu set still runs as separate phases around the shuffled sets. So -cpu=1,2,4 runs a shuffled set with cpu=1, a shuffled set (probably differently shuffled) with cpu=2, and then another with cpu=4.

We should probably check whether the current tests -count=3 doing A,B,C,A,B,C,A,B,C runs the parallel tests after each A,B,C or at the end of all of them. If the former then the shuffle will be a bit different, maybe too different.

cristaloleg commented 1 year ago

I haven't thought about -cpu flag. Don't have strong cons/pros why it should/not be involved here. Same for parallel tests.

However, I really like that tests gonna be ran in a more arbitrary order than before.

(I assume -p flag should not change the behaviour in any way)

TheCoreMan commented 1 year ago

Just putting in my 2 cents: whenever I used shuffle I expect the order to change between runs, even (or especially) when using -count. That's why I use "shuffle" 🎴

aclements commented 1 year ago

We should probably check whether the current tests -count=3 doing A,B,C,A,B,C,A,B,C runs the parallel tests after each A,B,C or at the end of all of them.

The answer appears to be that it runs A,B,C,Parallel,A,B,C,Parallel,etc. That is, parallel tests run after each A,B,C, not at the end of all of them.

That's a bit unfortunate for shuffling the whole list. I think that means for tests, we either need to shuffle within each "A,B,C,Parallel" (but independently for each count), or we need to put all of the parallel tests at the end (whether or not we're shuffling).

We could still do a whole-list shuffle for benchmarks. The behavior would be different between tests and benchmarks, but that's already true today, and beyond some pedagogical purity, I'm not sure it matters if they're different.

bcmills commented 1 year ago

One more option for shuffling would be to periodically toggle between Parallel and not. For example, we could shuffle in one “release parallel” event per -count, and then tack on one more “release parallel” event at the end of everything.

(When we hit a `“release parallel” event we would unblock all of the parallel tests buffered so far, wait for them to finish, and then resume running the next possibly-non-parallel test in shuffled order.)

rsc commented 1 year ago

Sounds like the simplest thing to do is say for tests we only shuffle between the individual sections and for benchmarks we shuffle the whole thing.

rsc commented 1 year ago

Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group

rsc commented 1 year ago

No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group

gopherbot commented 1 week ago

Change https://go.dev/cl/631016 mentions this issue: testing: increase shuffling

neild commented 4 days ago

Sounds like the simplest thing to do is say for tests we only shuffle between the individual sections and for benchmarks we shuffle the whole thing.

I'm curious about the motivation for the distinction. Why not shuffle sections for both? Shuffling the whole thing requires building a complete N*M list (I think, perhaps there's a nice algorithm for streaming permutations I don't know about). That's probably not a problem, but I wince a bit at -count=N allocating an slice of arbitrarily large size.