DENG-476 - Configure main ping to split out use counters

relud commented 1 year ago

Checklist for reviewer:

[ ] Commits should reference a bug or github issue, if relevant (if a bug is referenced, the pull request should include the bug number in the title)
[ ] If adding a new field, the field should have a description (see #576 for an example)
[ ] If coming from a fork, run integration tests: ./.github/push-to-trigger-integration <username>:<branchname>

For glean changes:

[ ] Update templates/include/glean/CHANGELOG.md

For modifications to schemas in restricted namespaces (see CODEOWNERS):

[ ] Follow the change control procedure

dataops-ci-bot commented 1 year ago

Integration report for "Configure main ping to split out use counters"

`bq_schema_331c174c-ad19adc8.diff`

Click to expand!

```diff diff --new-file --exclude '*.txt' /app/integration/331c174c/metadata.metaschema.1.bq /app/integration/ad19adc8/metadata.metaschema.1.bq 88a89,153 > "description": "Configuration for splitting a ping into multiple pings by field", > "fields": [ > { > "description": "Whether or not to output the unmodified original ping in addition to any generated pings.", > "mode": "NULLABLE", > "name": "preserve_original", > "type": "BOOL" > }, > { > "description": "If present, generate a ping containing all fields not included in any subset ping.", > "fields": [ > { > "mode": "NULLABLE", > "name": "document_namespace", > "type": "STRING" > }, > { > "mode": "NULLABLE", > "name": "document_type", > "type": "STRING" > }, > { > "mode": "NULLABLE", > "name": "document_version", > "type": "STRING" > } > ], > "mode": "NULLABLE", > "name": "remainder", > "type": "RECORD" > }, > { > "fields": [ > { > "mode": "NULLABLE", > "name": "document_namespace", > "type": "STRING" > }, > { > "mode": "NULLABLE", > "name": "document_type", > "type": "STRING" > }, > { > "mode": "NULLABLE", > "name": "document_version", > "type": "STRING" > }, > { > "description": "Regular expression matching .-delimited property names that should be moved to this subset ping. Only properties explictly defined in the non-generic json schema of the original ping are supported, because property names are matched during schema generation.", > "mode": "NULLABLE", > "name": "pattern", > "type": "STRING" > } > ], > "mode": "REPEATED", > "name": "subsets", > "type": "RECORD" > } > ], > "mode": "NULLABLE", > "name": "split_config", > "type": "RECORD" > }, > { ```

`compact_schema_331c174c-ad19adc8.diff`

Click to expand!

```diff diff --new-file --exclude '*.bq' /app/integration/331c174c/metadata.metaschema.1.txt /app/integration/ad19adc8/metadata.metaschema.1.txt 11a12,19 > root.moz_pipeline_metadata.split_config.preserve_original BOOL > root.moz_pipeline_metadata.split_config.remainder.document_namespace STRING > root.moz_pipeline_metadata.split_config.remainder.document_type STRING > root.moz_pipeline_metadata.split_config.remainder.document_version STRING > root.moz_pipeline_metadata.split_config.subsets.[].document_namespace STRING > root.moz_pipeline_metadata.split_config.subsets.[].document_type STRING > root.moz_pipeline_metadata.split_config.subsets.[].document_version STRING > root.moz_pipeline_metadata.split_config.subsets.[].pattern STRING ```

dataops-ci-bot commented 1 year ago

Integration report for "Configure main ping to split out use counters"

`bq_schema_331c174c-3ecbb307.diff`

Click to expand!

```diff diff --new-file --exclude '*.txt' /app/integration/331c174c/metadata.metaschema.1.bq /app/integration/3ecbb307/metadata.metaschema.1.bq 88a89,153 > "description": "Configuration for splitting a ping into multiple pings by field", > "fields": [ > { > "description": "Whether or not to output the unmodified original ping in addition to any generated pings.", > "mode": "NULLABLE", > "name": "preserve_original", > "type": "BOOL" > }, > { > "description": "If present, generate a ping containing all fields not included in any subset ping.", > "fields": [ > { > "mode": "NULLABLE", > "name": "document_namespace", > "type": "STRING" > }, > { > "mode": "NULLABLE", > "name": "document_type", > "type": "STRING" > }, > { > "mode": "NULLABLE", > "name": "document_version", > "type": "STRING" > } > ], > "mode": "NULLABLE", > "name": "remainder", > "type": "RECORD" > }, > { > "fields": [ > { > "mode": "NULLABLE", > "name": "document_namespace", > "type": "STRING" > }, > { > "mode": "NULLABLE", > "name": "document_type", > "type": "STRING" > }, > { > "mode": "NULLABLE", > "name": "document_version", > "type": "STRING" > }, > { > "description": "Regular expression matching .-delimited property names that should be moved to this subset ping. Only properties explictly defined in the non-generic json schema of the original ping are supported, because property names are matched during schema generation.", > "mode": "NULLABLE", > "name": "pattern", > "type": "STRING" > } > ], > "mode": "REPEATED", > "name": "subsets", > "type": "RECORD" > } > ], > "mode": "NULLABLE", > "name": "split_config", > "type": "RECORD" > }, > { ```

`compact_schema_331c174c-3ecbb307.diff`

Click to expand!

```diff diff --new-file --exclude '*.bq' /app/integration/331c174c/metadata.metaschema.1.txt /app/integration/3ecbb307/metadata.metaschema.1.txt 11a12,19 > root.moz_pipeline_metadata.split_config.preserve_original BOOL > root.moz_pipeline_metadata.split_config.remainder.document_namespace STRING > root.moz_pipeline_metadata.split_config.remainder.document_type STRING > root.moz_pipeline_metadata.split_config.remainder.document_version STRING > root.moz_pipeline_metadata.split_config.subsets.[].document_namespace STRING > root.moz_pipeline_metadata.split_config.subsets.[].document_type STRING > root.moz_pipeline_metadata.split_config.subsets.[].document_version STRING > root.moz_pipeline_metadata.split_config.subsets.[].pattern STRING ```

relud commented 1 year ago

It's not clear to me though how this addresses https://mozilla-hub.atlassian.net/browse/DENG-476

it addresses goal 2:

goal 2: reduce retention of main pings in favor of derived table with a schema that performs better in bigquery, especially for shredder

by creating new tables that have fewer columns, the resulting tables should perform much better in bigquery. in particular, shredding use counters and everything else separately reduced total compute time for the tables by 25-30%.

mozilla-services / mozilla-pipeline-schemas