m-lab / etl-gardener

Gardener provides services for maintaining and reprocessing mlab data.
Apache License 2.0
13 stars 5 forks source link

Flexible configuration - specify output tables through configuration - versioned tables #349

Closed stephen-soltesz closed 11 months ago

stephen-soltesz commented 2 years ago

Today gardener accepts a configuration that specifies a start date, source buckets, experiment and datatype names, and the target bigquery dataset and table name.

For example:

- bucket: archive-measurement-lab
  experiment: ndt
  datatype: pcap
  target: raw_ndt.pcap

But, the internal logic of the v2 pipeline ignores the target field, uses static tmp_ and raw_ dataset prefixes, and performs steps that are not configurable (JOINs) that creates constraints on what target table names are used in practice.

This has impaired our ability to be agile in at least two cases, more will come in time.

  1. experimental pcap parser with new schemas without interfering with other sandbox deployments.
  2. experimental annotation parsing from the synthetic annotation export process.

What we did was use the standard configuration. Ideally, we would have been able to specify an alternate target table and gardener would have "just worked" with that. For example:

- bucket: archive-measurement-lab
  experiment: ndt
  datatype: pcap
  datasets:
      temp: tmp_gfr_
      raw: raw_gfr_
- bucket: archive-mlab-sandbox
  experiment: ndt
  datatype: annotation
  datasets:
      temp: tmp_soltesz_
      raw: raw_soltesz_

This cannot work today because gardener hard codes the dataset prefix: e.g. "raw_*" - https://github.com/m-lab/etl-gardener/blob/master/cloud/bq/ops.go#L162

These should be inferred.

If gardener configuration allowed this degree of flexibility, and the parsers honored the output target for jobs sent by the gardener, then "versioned tables" could be implemented simply as a configuration here. For example:

- bucket: archive-measurement-lab
  experiment: ndt
  datatype: pcap
  datasets:
      temp: v1_tmp_
      raw: v1_raw_
stephen-soltesz commented 2 years ago
stephen-soltesz commented 11 months ago

Long completed.