m-lab / etl-gardener

Gardener provides services for maintaining and reprocessing mlab data.
Apache License 2.0
13 stars 5 forks source link

Feature: Audit table, and versioned job summary files in etl-mlab-* type-date paths #332

Open gfr10598 opened 3 years ago

gfr10598 commented 3 years ago
  1. At the end of a parse cycle, collect the stats from the metadata provided by parser on all json files.
  2. Create a json summary file.
  3. Compare to summary of data archives, obtained by query of objects in gs://archive-mlab-.../exp/type/yyyy/mm/dd
  4. Save a new version of the summary file, with a lifecycle hold, so that it isn't automatically cleaned up.
  5. Compare to previous summary file, to identify any missing files, or files with fewer rows.
  6. Trigger additional parsing attempts to recover lost files or rows.

The summary file (perhaps call it manifest?) would persist as a record of what would then be loaded by BQ load, while the json files would be regarded as temporary, and might be cleaned up by lifecycle rules to minimize storage costs.

Gardener might also add a corresponding row to a BQ audit table, including the date/time, prefix, file stats (number of files, bytes in both archive-mlab-... and json-mlab-...), summary of responses from parser, row count (from metadata), and row count from actual BQ load.

This audit table would be maintained indefinitely, and useful for tracking very coarse level information about every gardener reprocessing (and daily) job.

If for some reason a gardener job is abandoned, the versioned summary file might or might not be written, and Gardener might also write an audit table entry, to indicate that the job was attempted but not completed.