mozilla-services / mozilla-pipeline-schemas

Schemas for Mozilla's data ingestion pipeline and data lake outputs
https://protosaur.dev/mps-deploys/
Other
46 stars 95 forks source link

DENG-476 - Configure main ping to split out use counters #776

Closed relud closed 1 year ago

relud commented 1 year ago

see also https://github.com/mozilla/gcp-ingestion/pull/2380 and https://github.com/mozilla/mozilla-schema-generator/pull/244

Checklist for reviewer:

For glean changes:

For modifications to schemas in restricted namespaces (see CODEOWNERS):

dataops-ci-bot commented 1 year ago

Integration report for "Configure main ping to split out use counters"

bq_schema_331c174c-ad19adc8.diff

Click to expand! ```diff diff --new-file --exclude '*.txt' /app/integration/331c174c/metadata.metaschema.1.bq /app/integration/ad19adc8/metadata.metaschema.1.bq 88a89,153 > "description": "Configuration for splitting a ping into multiple pings by field", > "fields": [ > { > "description": "Whether or not to output the unmodified original ping in addition to any generated pings.", > "mode": "NULLABLE", > "name": "preserve_original", > "type": "BOOL" > }, > { > "description": "If present, generate a ping containing all fields not included in any subset ping.", > "fields": [ > { > "mode": "NULLABLE", > "name": "document_namespace", > "type": "STRING" > }, > { > "mode": "NULLABLE", > "name": "document_type", > "type": "STRING" > }, > { > "mode": "NULLABLE", > "name": "document_version", > "type": "STRING" > } > ], > "mode": "NULLABLE", > "name": "remainder", > "type": "RECORD" > }, > { > "fields": [ > { > "mode": "NULLABLE", > "name": "document_namespace", > "type": "STRING" > }, > { > "mode": "NULLABLE", > "name": "document_type", > "type": "STRING" > }, > { > "mode": "NULLABLE", > "name": "document_version", > "type": "STRING" > }, > { > "description": "Regular expression matching .-delimited property names that should be moved to this subset ping. Only properties explictly defined in the non-generic json schema of the original ping are supported, because property names are matched during schema generation.", > "mode": "NULLABLE", > "name": "pattern", > "type": "STRING" > } > ], > "mode": "REPEATED", > "name": "subsets", > "type": "RECORD" > } > ], > "mode": "NULLABLE", > "name": "split_config", > "type": "RECORD" > }, > { ```

compact_schema_331c174c-ad19adc8.diff

Click to expand! ```diff diff --new-file --exclude '*.bq' /app/integration/331c174c/metadata.metaschema.1.txt /app/integration/ad19adc8/metadata.metaschema.1.txt 11a12,19 > root.moz_pipeline_metadata.split_config.preserve_original BOOL > root.moz_pipeline_metadata.split_config.remainder.document_namespace STRING > root.moz_pipeline_metadata.split_config.remainder.document_type STRING > root.moz_pipeline_metadata.split_config.remainder.document_version STRING > root.moz_pipeline_metadata.split_config.subsets.[].document_namespace STRING > root.moz_pipeline_metadata.split_config.subsets.[].document_type STRING > root.moz_pipeline_metadata.split_config.subsets.[].document_version STRING > root.moz_pipeline_metadata.split_config.subsets.[].pattern STRING ```
dataops-ci-bot commented 1 year ago

Integration report for "Configure main ping to split out use counters"

bq_schema_331c174c-3ecbb307.diff

Click to expand! ```diff diff --new-file --exclude '*.txt' /app/integration/331c174c/metadata.metaschema.1.bq /app/integration/3ecbb307/metadata.metaschema.1.bq 88a89,153 > "description": "Configuration for splitting a ping into multiple pings by field", > "fields": [ > { > "description": "Whether or not to output the unmodified original ping in addition to any generated pings.", > "mode": "NULLABLE", > "name": "preserve_original", > "type": "BOOL" > }, > { > "description": "If present, generate a ping containing all fields not included in any subset ping.", > "fields": [ > { > "mode": "NULLABLE", > "name": "document_namespace", > "type": "STRING" > }, > { > "mode": "NULLABLE", > "name": "document_type", > "type": "STRING" > }, > { > "mode": "NULLABLE", > "name": "document_version", > "type": "STRING" > } > ], > "mode": "NULLABLE", > "name": "remainder", > "type": "RECORD" > }, > { > "fields": [ > { > "mode": "NULLABLE", > "name": "document_namespace", > "type": "STRING" > }, > { > "mode": "NULLABLE", > "name": "document_type", > "type": "STRING" > }, > { > "mode": "NULLABLE", > "name": "document_version", > "type": "STRING" > }, > { > "description": "Regular expression matching .-delimited property names that should be moved to this subset ping. Only properties explictly defined in the non-generic json schema of the original ping are supported, because property names are matched during schema generation.", > "mode": "NULLABLE", > "name": "pattern", > "type": "STRING" > } > ], > "mode": "REPEATED", > "name": "subsets", > "type": "RECORD" > } > ], > "mode": "NULLABLE", > "name": "split_config", > "type": "RECORD" > }, > { ```

compact_schema_331c174c-3ecbb307.diff

Click to expand! ```diff diff --new-file --exclude '*.bq' /app/integration/331c174c/metadata.metaschema.1.txt /app/integration/3ecbb307/metadata.metaschema.1.txt 11a12,19 > root.moz_pipeline_metadata.split_config.preserve_original BOOL > root.moz_pipeline_metadata.split_config.remainder.document_namespace STRING > root.moz_pipeline_metadata.split_config.remainder.document_type STRING > root.moz_pipeline_metadata.split_config.remainder.document_version STRING > root.moz_pipeline_metadata.split_config.subsets.[].document_namespace STRING > root.moz_pipeline_metadata.split_config.subsets.[].document_type STRING > root.moz_pipeline_metadata.split_config.subsets.[].document_version STRING > root.moz_pipeline_metadata.split_config.subsets.[].pattern STRING ```
relud commented 1 year ago

It's not clear to me though how this addresses https://mozilla-hub.atlassian.net/browse/DENG-476

it addresses goal 2:

goal 2: reduce retention of main pings in favor of derived table with a schema that performs better in bigquery, especially for shredder

by creating new tables that have fewer columns, the resulting tables should perform much better in bigquery. in particular, shredding use counters and everything else separately reduced total compute time for the tables by 25-30%.