2009 and 2012 have archives with *NO* unique tests

Part of epic #123

Many dates in 2009 and 2012 include archives that disappear when deduplicated, because all their tests also appear in other archives. Because the dedup prefers the last parse time, and parsing is done in arbitrary order, the archives (task_filename) that appear will be different on each parsing.

Because we do a sanity check to ensure that we haven't "lost" archives in the reprocessing, this data redundancy is causing failed updates, since the original processing did not deduplicate the data.

We should do three things:

Change the dedup to prefer: a. test rows that have more snapshots (in case an earlier version of the file was truncated) b. tests that include metadata (in case some archives don't include the .meta file) c. tests from earlier dates archives (to make the choice stable) d. tests with later parse_time (to choose tests processed in a successful task, rather than a terminated task)
Proactively remove tests from the existing prod tables, using a black-list of archive names that contain only redundant tests (or a similar deduplication strategy).
Document this is a clear way so that other archive users are aware of the problem.

m-lab / etl-gardener

2009 and 2012 have archives with NO unique tests #138

m-lab / etl-gardener

2009 and 2012 have archives with *NO* unique tests #138

2009 and 2012 have archives with NO unique tests #138