Closed GoogleCodeExporter closed 9 years ago
Original comment by vadimsh@chromium.org
on 6 Feb 2014 at 7:28
So I've got a rough idea of how I can implement this (and I don't think it
would be too much work).
Basically when we request a job, swarm checks the test_case_name and the config
name to see if we already that in the system. If we don't, then it is business
as usually.
If we do, Swarm checks that it is requesting the same number of instances as we
currently have. If the count differ, we abort (This could cause some trouble if
we change the shard count, but I'm not sure how to easily fix this, but I think
it is ok to just not allow it for now and then we can add support for it later).
When looking at the old runs, if they passed or haven't finished yet we do
nothing. If they failed, then we trigger a new run for that shard. This means
will only rerun individual shards that fail. We could add something in the
future to random retried passed runs (or let users says don't use cached
results), if we have problems with flaky tests passing and not running again.
The other half of this change would be to modify get_matching_test_cases to
specify it only wants the latest shards. The default would still returns all
the test cases, but with some parameter (or maybe new url) it would ignore
shards that were failed and then run again later.
This does create a situation where:
-Bot A triggers a test
-The test fails
-Bot B triggers the same test
-Bot A calls get_matching_test_cases and has to wait for the newer version of
the test to finish (instead of the original test)
The gap between when our bots trigger tests and then try to collect them is
pretty small though (I think), so this doesn't seem like a big issue. Plus, if
the failure was a flaky once Swarm is accidentally retrying the flake for the
bot, potentially reducing visible flakes (But again, small window so probably
unlikely).
Thoughts?
Original comment by csharp@chromium.org
on 7 Feb 2014 at 3:53
(Just adding here so it is on this bug as well)
If we start storing output files on the isolate server we need to be careful
that when we reuse a test, its output is still stored on the isolate server.
Otherwise Swarm won't start a new version of the test, since it thinks the
results are cached, but we are actually missing some of the cached results.
Original comment by csharp@chromium.org
on 7 Feb 2014 at 3:56
I prefer a "re-run everything or run nothing approach". I think running only
the subset of shards that failed is going to be very non-intuitive for users.
In practice, this is also easier to implement (I think).
Original comment by maruel@chromium.org
on 11 Feb 2014 at 3:52
Fixed with 8af24ab04a8f.
I'll work on a brief design doc to explain this feature (and the reasoning for
it)
Original comment by csharp@chromium.org
on 28 Feb 2014 at 2:17
Original issue reported on code.google.com by
vadi...@google.com
on 6 Feb 2014 at 7:28