m-lab / etl-gardener

Gardener provides services for maintaining and reprocessing mlab data.
Apache License 2.0
13 stars 5 forks source link

Deploy new gardener and k8s etl parser to prod #305

Open gfr10598 opened 3 years ago

gfr10598 commented 3 years ago

Looks like prod mostly runs in us-central instead of east region. So the new k8s cluster should probably be there too.

There is some documentation in the README.md file from January.

Steps:

  1. Create data-processing cluster, with appropriate networking options
  2. Create node-pools for etl and gardener.
  3. Add cloud builder rule for etl prod- tags.
gfr10598 commented 3 years ago

Based on info in README.md, added create-cluster.sh in new branch, which has all the gcloud commands to set up the network, subnet, firewall rules, cluster, and node-pools.

gfr10598 commented 3 years ago

Manually added cloud build trigger. Note that gcloud beta builds now supports creating triggers, too.

gcloud beta builds triggers create github \ --repo-name=[REPO_NAME] \ --repo-owner=[REPO_OWNER] \ --branch-pattern=".*" \ --build-config=[BUILD_CONFIG_FILE] \

gfr10598 commented 3 years ago

bq --project=mlab-oti mk tmp_ndt bq --project=mlab-oti mk raw_ndt

Need to add the table creation and schema updates to etl-schema.

gfr10598 commented 3 years ago

CREATE OR REPLACE TABLE mlab-oti.raw_ndt.ndt7 PARTITION BY date CLUSTER BY metro AS SELECT date, REGEXP_EXTRACT(parser.ArchiveURL , ".-mlab[1-4]-([a-z]{3})[0-9]{2}.") AS metro, id, * EXCEPT(date,id) FROM mlab-sandbox.tmp_ndt.ndt7 WHERE date > CURRENT_DATE()

gfr10598 commented 3 years ago

CREATE OR REPLACE TABLE mlab-oti.raw_ndt.annotation PARTITION BY date CLUSTER BY metro AS SELECT date, REGEXP_EXTRACT(parser.ArchiveURL , ".-mlab[1-4]-([a-z]{3})[0-9]{2}.") AS metro, id, * EXCEPT(date,id) FROM mlab-sandbox.tmp_ndt.annotation WHERE date > CURRENT_DATE()

gfr10598 commented 3 years ago

NOTE: bigquery does not store data in us-central. This may mean that we will get network egress charges for the BQ loads?

Probably should specify the BQ dataset data_location=US to make it multi-regional. See https://cloud.google.com/bigquery/docs/locations#multi-regional-locations

The documentation is not crystal clear, so we should probably just look for these charges in billing.