Closed jklukas closed 5 years ago
This is another backwards incompatible change, we should be prepared to drop data at the end of it. Here is where we currently modify properties to satisfy BigQuery column naming:
This is another backwards incompatible change, we should be prepared to drop data at the end of it.
Yes, I want to roll together any remaining disruptive changes in the next few weeks so that we can deploy reliable prod stacks. In particular, I'd love to roll this together with deploying a final main ping schema representation.
in direct2parquet we did snake case partially because of the issues with camel case described here: https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html
We discussed this in the data platform team meeting today, and there was a question as to where this should happen in the pipeline. The choices are:
I worry that trying to handle this in a BigQuery view won't be tenable. It's potentially a very large amount of query text for the view, which could introduce performance issues.
I worry that trying to handle this in a BigQuery view won't be tenable. It's potentially a very large amount of query text for the view, which could introduce performance issues.
that could also be an issue with bigquery query size quota:
- Maximum unresolved standard SQL query length — 1 MB
Maximum resolved legacy and standard SQL query length — 12 MB
The limit on resolved query length includes the length of all views and wildcard tables referenced by the query.
See proposal document: https://docs.google.com/document/d/1lY0yGiC8Okx0eJAzI_Pik3-kW4CgNBN2OhMLHlM_uLQ/edit
The snake casing document is closing today. After some refactoring about where we apply snake casing, we are still planning to use snake case in BQ, so this change is needed.
@acmiyaguchi is going to work on this early next week.
See https://github.com/mozilla/gcp-ingestion/issues/671
The direct to parquet datasets coerce camel case keys to snake case, but right now our pipeline of pings into bigquery does not.
I think this consistent naming would be desirable and it would be best/simplest to handle it in the pipeline rather than deferring to views. This would require a coordinate change in the schema transpiler and in the BigQuery sink dataflow jobs.