materialized tables must have schema updates also

m-lab / etl-gardener

Gardener provides services for maintaining and reprocessing mlab data.

Apache License 2.0

13 stars 5 forks source link

materialized tables must have schema updates also #325

Closed stephen-soltesz closed 2 years ago

stephen-soltesz commented 3 years ago

An undetected failure in staging since XXX and production since 01/25 is a result of additional fields added to tcpinfo but missing in the materialized ndt7 tables. The net result is failure to join the annotation and raw_ndt.ndt7 data b/c the destination table schemas did not match the source schemas.

Example error log:

2021/03/02 18:12:27 actions.go:106: googleapi: Error 400: Invalid schema update. Cannot add fields (field: raw.Upload.ServerMeasurements.TCPInfo.RcvOooPack), invalid 400
2021/03/02 18:12:27 actions.go:111: 20200316:ndt/ndt7 Join googleapi: Error 400: Invalid schema update. Cannot add fields (field: raw.Upload.ServerMeasurements.TCPInfo.RcvOooPack), invalid

This is a fundamental problem. The etl/cmd/update-schema command operates on the raw tables for primary datatypes. The joined tables have derived schemas that are a combination of the input tables.

stephen-soltesz commented 3 years ago

2020-09-10
- Original update to tcp-info struct: https://github.com/m-lab/tcp-info/pull/123
2020-09-16
- Merge next change to etl causes staging to rebuild with new tcp-info struct https://github.com/m-lab/etl/pull/956
- Last data joined in staging 2020-09-15
2021-01-25
- First production tag on gardener since 2020-10-14 breaks production joins https://github.com/m-lab/etl-gardener/releases/tag/prod-v2.4.1
- The "prod-*" tag deploys both legacy and universal gardeners https://github.com/m-lab/etl-gardener/blob/master/.travis.yml#L164

stephen-soltesz commented 3 years ago

Manual repair using bq -- the sandbox table was recreated with correct schema. Export that schema and update the staging and prod tables to match.

# Guarantee that the sandbox table has been recreated by first removing it and allowing gardener to recreate it.
bq rm mlab-sandbox:ndt.ndt7

# After the table exists again, use it as a reference for updating later tables.
bq show --format=prettyjson mlab-sandbox:ndt.ndt7 | jq .schema.fields > ndt7.schema
bq update mlab-staging:ndt.ndt7 ndt7.schema 
bq update mlab-oti:ndt.ndt7 ndt7.schema

stephen-soltesz commented 3 years ago

In progress: create alerts using comparison of metrics like:

bq_daily_archive_count{datatype=~"ndt7|annotation"}

and,

increase(gcs_archive_files_total{bucket="archive-measurement-lab", experiment="ndt", datatype=~"ndt7|annotation"}[1d] offset 2d)

stephen-soltesz commented 3 years ago

stephen-soltesz commented 3 years ago

An additional QueryConfig option in https://github.com/m-lab/etl-gardener/blob/d9582f1131b5d978adbe64ffd341bfd47fda8718/cloud/bq/ops.go#L260-L274

Can automatically allow field addition:

SchemaUpdateOptions: []string{"ALLOW_FIELD_ADDITION", "ALLOW_FIELD_RELAXATION"},