m-lab / etl-gardener

Gardener provides services for maintaining and reprocessing mlab data.
Apache License 2.0
13 stars 5 forks source link

Support flexible target dataset configurations #350

Closed stephen-soltesz closed 1 year ago

stephen-soltesz commented 2 years ago

This change starts to support flexible dataset configurations first noted in https://github.com/m-lab/etl-gardener/issues/349

This change would obsolete the Target field from the config.SourceConfig and tracker.Job structures. In its place we add Datasets record with three fields for Temp the temporary table, Raw the raw, 1:1 with GCS files, deduped table, and Join for the joined results between the raw and other datatypes. Previously these prefixes were statically defined within the gardener templates. Now, the configuration may specify alternate output target locations.


This change is Reviewable

stephen-soltesz commented 2 years ago

@SaiedKazemi FYI

coveralls commented 2 years ago

Coverage Status

Coverage increased (+0.2%) to 60.651% when pulling 16f51eb7347d3e5e79a49d23dced54008cc92b9e on sandbox-soltesz-flexible-config into 06fab084fb8fc54a26a42ed6a2ab1e3ff22dcd1c on master.

stephen-soltesz commented 2 years ago

Alone this does not WAI - the tracker.Job struct is used as a map key - and b/c etl is not updated with this new structure, the value passed to etl and back for updates has a different value, so progress is not recorded.

Ideally, there would be separation between the API structure used to communicate with ETL and that used for internal tracking so the two systems could be upgraded separately.

stephen-soltesz commented 1 year ago

Obsolete