m-lab / etl

M-Lab ingestion pipeline
Apache License 2.0
22 stars 7 forks source link

Migrate data-processing clusters to us-central1 #1092

Open stephen-soltesz opened 2 years ago

stephen-soltesz commented 2 years ago

The data-processing cluster in mlab-sandbox & mlab-staging is in us-east, while the archive-measurement-lab bucket is in us-central1. These clusters should be redeployed to us-central, and their output buckets recreated in us-central. Since we want the GKE cluster to be managed by Terraform, we will recreate the production cluster as well.

Production deployment

Clean up tasks after deployments:

Consider

stephen-soltesz commented 1 year ago

Due to the v2 data pipeline cluster location in some projects, data must be transferred between regions in sandbox and staging project. This can be eliminated by placing these projects in us-central1 region.

mlab-oti     archive-measurement-lab us-central1 to data-processing us-central1
mlab-staging archive-measurement-lab us-central1 to data-processing us-east1
mlab-sandbox archive-measurement-lab us-central1 to data-processing us-east1
etl-mlab-sandbox    Jun 13, 2017, 3:22:04 PM    Region  us-east1
etl-mlab-staging    Jul 31, 2020, 4:03:17 PM    Region  us-east1
etl-mlab-oti        Aug  6, 2020, 7:48:10 PM    Region  us-central1

Since this requires updates to sandbox and staging projects, the disruption will be minimal.

Changing the data-processing cluster locations will be easy. Changing the output target buckets may not be..

stephen-soltesz commented 1 year ago

The data-processing cluster includes multiple node pools for service-specific workloads:

The commands used to create these node pools are various (and likely dated or incomplete):

stephen-soltesz commented 1 year ago

Repositories with services on the data-processing cluster (one per node pool):

stephen-soltesz commented 1 year ago

This should be completed using Terraform not manual, adhoc recreations.

stephen-soltesz commented 10 months ago

Evidently, while gcloud supports bulk-export for some resource types, GKE is not yet one of them.

Documentation on the Terraform gke module

stephen-soltesz commented 10 months ago

GKE resource is called something else in this context, ContainerEngine, and ContainerNodePools

Running this command requires additional permissions than basic roles alone. https://cloud.google.com/asset-inventory/docs/access-control#required_permissions

gcloud beta resource-config bulk-export \
   --resource-types=ContainerCluster,ContainerNodePool \
   --project=mlab-sandbox --resource-format=terraform \
   --path=output

Additional types are ComputeNetwork and ComputeSubnetwork for declaring the VPC networks over which the cluster communicates.

gcloud beta resource-config list-resource-types
gcloud beta resource-config bulk-export  \
    --resource-types=ComputeNetwork,ComputeSubnetwork \
    --project=mlab-sandbox --resource-format=terraform --path=output
stephen-soltesz commented 10 months ago

Current data processing cluster workloads are using deprecated APIs.

Screen Shot 2023-08-22 at 12 36 30 PM
stephen-soltesz commented 10 months ago

The deprecated APIs appear to be from kube-state-metrics (v2.2.4) from the prometheus-support configuration. Attempting to update to v2.9.2

stephen-soltesz commented 10 months ago

The archive-* buckets are "Multi-region" buckets:

Unclear if this has a significant impact on costs if it is not explicitly in the cluster region.

stephen-soltesz commented 10 months ago

Grafana must be restarted in each project to pickup the new datasources for the data-pipeline cluster.

stephen-soltesz commented 10 months ago

The egress traffic from measurement-lab to sandbox/staging appears to have decreased significantly over the weekend after stopping the data-processing cluster in the us-east last week.

Screen Shot 2023-08-28 at 10 53 48 AM
stephen-soltesz commented 10 months ago

And the gardener & autoloader appear to be WAI in staging over the weekend also.

Screen Shot 2023-08-28 at 10 56 50 AM Screen Shot 2023-08-28 at 10 58 41 AM