mozilla / jsonschema-transpiler

Compile JSON Schema into Avro and BigQuery schemas
Mozilla Public License 2.0
43 stars 9 forks source link

Convert camel case bigquery column names to snake case #77

Closed jklukas closed 5 years ago

jklukas commented 5 years ago

See https://github.com/mozilla/gcp-ingestion/issues/671

The direct to parquet datasets coerce camel case keys to snake case, but right now our pipeline of pings into bigquery does not.

I think this consistent naming would be desirable and it would be best/simplest to handle it in the pipeline rather than deferring to views. This would require a coordinate change in the schema transpiler and in the BigQuery sink dataflow jobs.

acmiyaguchi commented 5 years ago

This is another backwards incompatible change, we should be prepared to drop data at the end of it. Here is where we currently modify properties to satisfy BigQuery column naming:

https://github.com/mozilla/jsonschema-transpiler/blob/a532e90cb7734ce5aa94fa3aa3e7eeebbd1fb935/src/ast.rs#L450-L464

jklukas commented 5 years ago

This is another backwards incompatible change, we should be prepared to drop data at the end of it.

Yes, I want to roll together any remaining disruptive changes in the next few weeks so that we can deploy reliable prod stacks. In particular, I'd love to roll this together with deploying a final main ping schema representation.

relud commented 5 years ago

in direct2parquet we did snake case partially because of the issues with camel case described here: https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html

jklukas commented 5 years ago

We discussed this in the data platform team meeting today, and there was a question as to where this should happen in the pipeline. The choices are:

  1. Transform in the Decoder immediately after validating the payload using the JSON schema, meaning that the decoded streams in PubSub also have snake_case normalized payloads
  2. Transform in the BigQuery sink
  3. Transform in BigQuery via a view generated from the schema

I worry that trying to handle this in a BigQuery view won't be tenable. It's potentially a very large amount of query text for the view, which could introduce performance issues.

relud commented 5 years ago

I worry that trying to handle this in a BigQuery view won't be tenable. It's potentially a very large amount of query text for the view, which could introduce performance issues.

that could also be an issue with bigquery query size quota:

  • Maximum unresolved standard SQL query length — 1 MB
  • Maximum resolved legacy and standard SQL query length — 12 MB

    The limit on resolved query length includes the length of all views and wildcard tables referenced by the query.

jklukas commented 5 years ago

See proposal document: https://docs.google.com/document/d/1lY0yGiC8Okx0eJAzI_Pik3-kW4CgNBN2OhMLHlM_uLQ/edit

jklukas commented 5 years ago

The snake casing document is closing today. After some refactoring about where we apply snake casing, we are still planning to use snake case in BQ, so this change is needed.

jklukas commented 5 years ago

@acmiyaguchi is going to work on this early next week.