m-lab / etl-gardener

Gardener provides services for maintaining and reprocessing mlab data.
Apache License 2.0
13 stars 5 forks source link

2009 and 2012 have archives with *NO* unique tests #138

Open gfr10598 opened 5 years ago

gfr10598 commented 5 years ago

Part of epic #123

Many dates in 2009 and 2012 include archives that disappear when deduplicated, because all their tests also appear in other archives. Because the dedup prefers the last parse time, and parsing is done in arbitrary order, the archives (task_filename) that appear will be different on each parsing.

Because we do a sanity check to ensure that we haven't "lost" archives in the reprocessing, this data redundancy is causing failed updates, since the original processing did not deduplicate the data.

We should do three things:

  1. Change the dedup to prefer: a. test rows that have more snapshots (in case an earlier version of the file was truncated) b. tests that include metadata (in case some archives don't include the .meta file) c. tests from earlier dates archives (to make the choice stable) d. tests with later parse_time (to choose tests processed in a successful task, rather than a terminated task)

  2. Proactively remove tests from the existing prod tables, using a black-list of archive names that contain only redundant tests (or a similar deduplication strategy).

  3. Document this is a clear way so that other archive users are aware of the problem.

gfr10598 commented 5 years ago

An interesting nuance... If we also prefer tests that include metadata, then we end up including many more tasks. We only get about 0.05% more tests with metadata, but this is likely still preferable.

So, ordering first by metadata, then by task filename, then by whether files are gzipped, we get end up with about 3000 more task filenames, and 3000 more tests with metadata, suggesting that of the 14K redundant archives, about 3000 of them are useful because they include metadata that is missing from other archives.