m-lab / etl

M-Lab ingestion pipeline
Apache License 2.0
22 stars 7 forks source link

Support writing rows to GCS JSONL files, for BQ Load. #838

Open gfr10598 opened 4 years ago

gfr10598 commented 4 years ago

Streaming inserts are expensive and have assorted other headaches. We should move to BQ Load of JSON data. The first step in this is allowing parsers to export to JSONL files in GCS instead of to BQ inserts.

gfr10598 commented 4 years ago

The code for this was completed with #923. However, it also requires that the k8s node pool has adequate permissions on the target GCS bucket.

This is working properly in sandbox since mid June, but it is not clear what actions were taken to provide write access for the node-pool to json-mlab-sandbox. Surprisingly, the node-pool seems to have read only storage access.

Access scopes

gfr10598 commented 4 years ago

PROBABLY not useful.

The service account is determined by a secret in the k8s config.

gcloud container clusters get-credentials data-processing --region us-east1 --project mlab-sandbox 
kubectl get pod etl-parser-85dc68bd86-pcqzk -o yaml | grep secretName
> default-token-f4pmb

kubectl get secret default-token-f4pmb -o yaml | less

gcloud container clusters get-credentials data-processing --region us-east1 --project mlab-sandbox && kubectl get secret default-token-f4pmb -o yaml | less

Next: How does one find the SA and associated ACLs? The token name and the UID don't seem to appear on the SA console page.

gfr10598 commented 4 years ago

metadata: annotations: kubernetes.io/service-account.name: default kubernetes.io/service-account.uid: babbb228-fb69-11e9-933c-42010a8e020c creationTimestamp: 2019-10-30T23:05:13Z name: default-token-f4pmb namespace: default resourceVersion: "280" selfLink: /api/v1/namespaces/default/secrets/default-token-f4pmb uid: bac39308-fb69-11e9-933c-42010a8e020c

gfr10598 commented 4 years ago

Create new node-pools with custom service account.

gcloud --project=mlab-sandbox container node-pools create parser-pool-2 --cluster=data-processing \
--num-nodes=3 --region=us-east1 --scopes storage-rw,compute-rw,bigquery,datastore \
--node-labels=parser-node2=true --enable-autorepair --enable-autoupgrade \
--machine-type=n1-standard-8 --service-account=etl-k8s-parser@mlab-sandbox.iam.gserviceaccount.com

gcloud --project=mlab-staging container node-pools create parser-pool-2 --cluster=data-processing \
--num-nodes=3 --region=us-east1 --scopes storage-rw,compute-rw,bigquery,datastore \
--node-labels=parser-node2=true --enable-autorepair --enable-autoupgrade \
--machine-type=n1-standard-8 --service-account=etl-k8s-parser@mlab-staging.iam.gserviceaccount.com
gfr10598 commented 4 years ago

Using the new parser-pool-2 (parser-node2), GKE is unable to fetch the container from gcr.io.

Messing about with the etl-k8s-parser SA, by using local machine with gcloud auth and gcloud docker

gcloud auth configure-docker
gcloud auth activate-service-account etl-k8s-parser@mlab-sandbox.iam.gserviceaccount.com --key-file /Users/gfr/Downloads/mlab-sandbox-fde10b933796.json 
docker pull gcr.io/mlab-sandbox/github.com/m-lab/etl:8f0f7ae9e3ec9e51c11c146a82c1601672521d9d

docker pull gcr.io/mlab-sandbox/github.com/m-lab/etl:8f0f7ae9e3ec9e51c11c146a82c1601672521d9d

8f0f7ae9e3ec9e51c11c146a82c1601672521d9d: Pulling from mlab-sandbox/github.com/m-lab/etl
Digest: sha256:0dc53a3fe84b546b347f8a372b37f64e18100ca2b93e2e890c8870938cc11061
Status: Image is up to date for gcr.io/mlab-sandbox/github.com/m-lab/etl:8f0f7ae9e3ec9e51c11c146a82c1601672521d9d
gcr.io/mlab-sandbox/github.com/m-lab/etl:8f0f7ae9e3ec9e51c11c146a82c1601672521d9d

This works. Also works with default compute engine SA, and with gfr@