m-lab / etl-gardener

Gardener provides services for maintaining and reprocessing mlab data.
Apache License 2.0
13 stars 5 forks source link

Add target Datasets to gardener configuration #390

Closed stephen-soltesz closed 1 year ago

stephen-soltesz commented 1 year ago

This change adds target datasets to the gardener configuration, making the target output datasets locations controlled entirely by configuration rather than static constants in the gardener code and query templates.

With this change, it will be possible to configure both the gs://archive-measurement-lab source as well as project local archive buckets in sandbox and staging environments (if desired), or to add multiple output locations for development. It will also be possible to output "versioned tables" for breaking schema changes, either with the version in the dataset, e.g. v2_ndt.ndt7 or with a future change that adds the version as a suffix to the base table, e.g. ndt.ndt7_v2.

This change makes the set of tables "joined" with annotation data implicit with the Datasets.Join field. Datatypes without this field will not be joined.

Part of:


This change is Reviewable

coveralls commented 1 year ago

Pull Request Test Coverage Report for Build 3618


Changes Missing Coverage Covered Lines Changed/Added Lines %
cloud/bq/ops.go 1 6 16.67%
<!-- Total: 23 28 82.14% -->
Totals Coverage Status
Change from base Build 3603: 0.8%
Covered Lines: 1259
Relevant Lines: 1605

💛 - Coveralls
stephen-soltesz commented 1 year ago

@SaiedKazemi this PR is still a draft, but I would welcome your early feedback.

SaiedKazemi commented 1 year ago

I am not familiar with this repo and do not have much context but the changes looked good to me. Can you please specify how you tested the changes in the commit message?

:lgtm:
stephen-soltesz commented 1 year ago

@cristinaleonr I would welcome your feedback as well.

stephen-soltesz commented 1 year ago

Also FYI: @mattmathis - this is one way of supporting "versioned tables" through gardener configuration.

stephen-soltesz commented 1 year ago

Okay, I've left this PR as draft b/c there are two down stream dependencies to use this feature:

mattmathis commented 1 year ago

Are you using data set table names like tmp_ndt or ndt_tmp? The documentation and views currency use the latter, but point to the former. Is this OK?

stephen-soltesz commented 1 year ago

@cristinaleonr after a long interlude of dependencies now completed, this change is much more modest than how it started. When you have time, PTAL?