linux_swarm_triggered triggers multiple copies of same job and waits for all instances of it, failing in assert

GoogleCodeExporter commented 9 years ago

Different builds of 'base_unittests' on linux_rel sometimes ends up having 
exact same isolate hashes (due to incremental build and CL not touching base?). 
That causes linux_swarm_triggered to trigger multiple copies of corresponding 
swarm job.

Later 'swarming collect' gets result of ALL copies of a job. Over time number 
of copies becomes >256 and 'swarming collect' fails with assertion.

See two different builds that triggered same job:
http://build.chromium.org/p/tryserver.chromium/builders/linux_swarm_triggered/bu
ilds/22209/steps/base_unittests/logs/stdio
http://build.chromium.org/p/tryserver.chromium/builders/linux_swarm_triggered/bu
ilds/22208/steps/base_unittests/logs/stdio

Original issue reported on code.google.com by vadi...@google.com on 6 Feb 2014 at 7:28

GoogleCodeExporter commented 9 years ago

Original comment by vadimsh@chromium.org on 6 Feb 2014 at 7:28

Added labels: Priority-1, Module-SwarmServ
Removed labels: Priority-3

GoogleCodeExporter commented 9 years ago

So I've got a rough idea of how I can implement this (and I don't think it 
would be too much work).

Basically when we request a job, swarm checks the test_case_name and the config 
name to see if we already that in the system. If we don't, then it is business 
as usually.

If we do, Swarm checks that it is requesting the same number of instances as we 
currently have. If the count differ, we abort (This could cause some trouble if 
we change the shard count, but I'm not sure how to easily fix this, but I think 
it is ok to just not allow it for now and then we can add support for it later).

When looking at the old runs, if they passed or haven't finished yet we do 
nothing. If they failed, then we trigger a new run for that shard. This means 
will only rerun individual shards that fail. We could add something in the 
future to random retried passed runs (or let users says don't use cached 
results), if we have problems with flaky tests passing and not running again.

The other half of this change would be to modify get_matching_test_cases to 
specify it only wants the latest shards. The default would still returns all 
the test cases, but with some parameter (or maybe new url) it would ignore 
shards that were failed and then run again later.

This does create a situation where:
-Bot A triggers a test
-The test fails
-Bot B triggers the same test
-Bot A calls get_matching_test_cases and has to wait for the newer version of 
the test to finish (instead of the original test)

The gap between when our bots trigger tests and then try to collect them is 
pretty small though (I think), so this doesn't seem like a big issue. Plus, if 
the failure was a flaky once Swarm is accidentally retrying the flake for the 
bot, potentially reducing visible flakes (But again, small window so probably 
unlikely).

Thoughts?

Original comment by csharp@chromium.org on 7 Feb 2014 at 3:53

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

(Just adding here so it is on this bug as well)

If we start storing output files on the isolate server we need to be careful 
that when we reuse a test, its output is still stored on the isolate server. 
Otherwise Swarm won't start a new version of the test, since it thinks the 
results are cached, but we are actually missing some of the cached results.

Original comment by csharp@chromium.org on 7 Feb 2014 at 3:56

GoogleCodeExporter commented 9 years ago

I prefer a "re-run everything or run nothing approach". I think running only 
the subset of shards that failed is going to be very non-intuitive for users. 
In practice, this is also easier to implement (I think).

Original comment by maruel@chromium.org on 11 Feb 2014 at 3:52

GoogleCodeExporter commented 9 years ago

Fixed with 8af24ab04a8f.

I'll work on a brief design doc to explain this feature (and the reasoning for 
it)

Original comment by csharp@chromium.org on 28 Feb 2014 at 2:17

Changed state: Fixed

madecoste / swarming

linux_swarm_triggered triggers multiple copies of same job and waits for all instances of it, failing in assert #73