Migrate data-processing clusters to us-central1

stephen-soltesz commented 2 years ago

The data-processing cluster in mlab-sandbox & mlab-staging is in us-east, while the archive-measurement-lab bucket is in us-central1. These clusters should be redeployed to us-central, and their output buckets recreated in us-central. Since we want the GKE cluster to be managed by Terraform, we will recreate the production cluster as well.

[x] create new data-processing cluster in us-central1 for sandbox & staging
[x] create new etl-$PROJECT replacement bucket in us-central
[x] create new etl-$PROJECT-us-central1 buckets & update etl & gardener configuration to use them

Production deployment

[x] Tag terraform-support repo to create data-pipeline cluster in production

[x] Import service account into TF by hand.

terraform import module.data-pipeline.google_service_account.stats_pipeline  \
    projects/mlab-oti/serviceAccounts/stats-pipeline@mlab-oti.iam.gserviceaccount.com

[x] Add role binding to new GKE cluster:

kubectl create clusterrolebinding additional-cluster-admins  --clusterrole=cluster-admin  \
    --user=<id>@cloudbuild.gserviceaccount.com

[x] Update CB substitutions for the six data pipeline service repos.
[x] Tag all six data pipeline service repos to deploy to data-pipeline cluster
[x] Create DNS record for prometheus-data-pipeline.mlab-oti.measurementlab.net using Cluster LB address

Clean up tasks after deployments:

[x] Remove services from sandbox & staging data-processing cluster
[x] Remove services from production data-processing cluster
[x] Remove prometheus-data-processing.$PROJECT.* DNS records
[ ] Remove old data sources from prometheus-support & Grafana
[ ] Remove etl-$PROJECT intermediate buckets
[x] Remove data-processing clusters

Consider

[ ] recreating etl-$PROJECT bucket in us-central & update etl parser to use the short name again
[ ] recreating the archive-$PROJECT buckets to be single-region (not multi-region) in us-central

stephen-soltesz commented 1 year ago

Due to the v2 data pipeline cluster location in some projects, data must be transferred between regions in sandbox and staging project. This can be eliminated by placing these projects in us-central1 region.

mlab-oti     archive-measurement-lab us-central1 to data-processing us-central1
mlab-staging archive-measurement-lab us-central1 to data-processing us-east1
mlab-sandbox archive-measurement-lab us-central1 to data-processing us-east1

etl-mlab-sandbox    Jun 13, 2017, 3:22:04 PM    Region  us-east1
etl-mlab-staging    Jul 31, 2020, 4:03:17 PM    Region  us-east1
etl-mlab-oti        Aug  6, 2020, 7:48:10 PM    Region  us-central1

Since this requires updates to sandbox and staging projects, the disruption will be minimal.

Changing the data-processing cluster locations will be easy. Changing the output target buckets may not be..

stephen-soltesz commented 1 year ago

The data-processing cluster includes multiple node pools for service-specific workloads:

default-pool Ok 1.21.12-gke.2200 1 (0 - 1 per zone) n1-standard-4
downloader-pool Ok 1.21.12-gke.2200 3 (1 per zone) n1-standard-2
parser-pool Ok 1.21.12-gke.2200 8 (2 - 3 per zone) n1-standard-16
prometheus-pool Ok 1.21.12-gke.2200 3 (1 per zone) n1-standard-4
stats-pipeline-pool Ok 1.21.12-gke.2200 3 (1 per zone) n2-standard-8

The commands used to create these node pools are various (and likely dated or incomplete):

data-processing cluster (default pool) - https://github.com/m-lab/etl-gardener/blob/main/create-pipeline-cluster.sh
parser-pool https://github.com/m-lab/etl-gardener/blob/main/create-parser-pool.sh
downloader - https://github.com/m-lab/downloader/blob/main/README.md#cluster-creation
prometheus - a variant of https://github.com/m-lab/prometheus-support/blob/main/README.md#within-an-existing-cluster
stats-pipeline-pool - appears to have been ad-hoc and undocumented.

stephen-soltesz commented 1 year ago

Repositories with services on the data-processing cluster (one per node pool):

etl
etl-gardener
prometheus-support
stats-pipeline
downloader
autoloader

stephen-soltesz commented 1 year ago

This should be completed using Terraform not manual, adhoc recreations.