mozilla / jsonschema-transpiler

Compile JSON Schema into Avro and BigQuery schemas
Mozilla Public License 2.0
42 stars 10 forks source link

Validate BigQuery schemas by round-trip via Avro #63

Open acmiyaguchi opened 5 years ago

acmiyaguchi commented 5 years ago

With support for data in integration tests in #62, we can start to validate BigQuery schemas against the BigQuery itself. It should be done in the following way:

  1. Generate an avro file
  2. bq load
  3. bq show --format=prettyjson
  4. Generate a diff between the two formats
  5. File issues as appropriate
acmiyaguchi commented 5 years ago

This branch implements the basic logic for this workflow: https://github.com/acmiyaguchi/jsonschema-transpiler/tree/validation-roundtrip

In particular, this script will generate avro files, upload them to gcs, load them into BigQuery, and then fetch the results:

https://github.com/acmiyaguchi/jsonschema-transpiler/blob/74028f53d01f2f26e7838d57620318788d5dfc40/scripts/generate-bq-schema.sh#L1-L42

$ ./scripts/generate-bq-schema.sh 
4/4 succeeded
Copying file://avro-data/test-array.avro [Content-Type=application/octet-stream]...
Copying file://avro-data/test-long.avro [Content-Type=application/octet-stream]...
Copying file://avro-data/test-map.avro [Content-Type=application/octet-stream]...
Copying file://avro-data/test-string.avro [Content-Type=application/octet-stream]...
\ [4 files][  420.0 B/  420.0 B]                                                
Operation completed over 4 objects/420.0 B.                                      
Dataset 'amiyaguchi-unmap-json-bq-sink:test_bq_schema' successfully created.
Waiting on bqjob_r1874d3f3ac121bff_0000016a75410c76_1 ... (3s) Current status: DONE   
original avro schema:
{
  "items": {
    "type": "long"
  },
  "type": "array"
}
bigquery schema:
[
  {
    "type": "INTEGER",
    "name": "root",
    "mode": "REPEATED"
  }
]
Waiting on bqjob_r7aea09933351cacd_0000016a75412b0a_1 ... (2s) Current status: DONE   
original avro schema:
{
  "type": "string"
}
bigquery schema:
[
  {
    "type": "STRING",
    "name": "root",
    "mode": "REQUIRED"
  }
]
Waiting on bqjob_r2626f2a8102a662b_0000016a75414269_1 ... (2s) Current status: DONE   
original avro schema:
{
  "type": "long"
}
bigquery schema:
[
  {
    "type": "INTEGER",
    "name": "root",
    "mode": "REQUIRED"
  }
]
Waiting on bqjob_r740102a4f0834588_0000016a754158bd_1 ... (3s) Current status: DONE   
original avro schema:
{
  "type": "map",
  "values": {
    "type": "long"
  }
}
bigquery schema:
[
  {
    "fields": [
      {
        "type": "STRING",
        "name": "key",
        "mode": "REQUIRED"
      },
      {
        "type": "INTEGER",
        "name": "value",
        "mode": "REQUIRED"
      }
    ],
    "type": "RECORD",
    "name": "root",
    "mode": "REPEATED"
  }
]
Removing gs://amiyaguchi-unmap-json-bq-sink/test_bq_schema/test-array.avro#1556745355194021...
Removing gs://amiyaguchi-unmap-json-bq-sink/test_bq_schema/test-long.avro#1556745355610474...
Removing gs://amiyaguchi-unmap-json-bq-sink/test_bq_schema/test-map.avro#1556745355937243...
Removing gs://amiyaguchi-unmap-json-bq-sink/test_bq_schema/test-string.avro#1556745356199443...
/ [4 objects]                                                                   
Operation completed over 4 objects. 

It would be nice to generate data-sets for tests/resources so the resulting BigQuery schemas are validated.