mozilla / jsonschema-transpiler

Compile JSON Schema into Avro and BigQuery schemas
Mozilla Public License 2.0
42 stars 10 forks source link

Provide an error handling mechanism to drop, cast, or panic on under-specified fields #75

Closed acmiyaguchi closed 5 years ago

acmiyaguchi commented 5 years ago

As per https://github.com/mozilla/mozilla-schema-generator/pull/25#issuecomment-499213896:

{
    "metrics": {
        "type": "object",
        "additionalProperties": False
    }
}

is casted into a JSON blob type in BigQuery.

[
  {
    "mode": "NULLABLE",
    "name": "metrics",
    "type": "STRING"
  }
]

Instead of this, the output could be an empty struct. In general, an empty object could be represented as an empty struct. This option would work in BigQuery since maps are an extension of the struct type, but this could cause ambiguity in the avro/parquet representation.

fbertsch commented 5 years ago

@acmiyaguchi doesn't having "additionalProperties": false eliminate the ambiguity? For a map type, the corresponding JSON Schema representation would simply be "type": "object".

acmiyaguchi commented 5 years ago

@fbertsch You're right. There are probably a few cases that should be considered here:

{}
{"additionalProperties": false}
{"additionalProperties": true}
{"additionalProperties": {}}
{"additionalProperties": {"type": "integer"}

where the resulting types are

struct<>
struct<>
struct<>
map<string, struct<>>
map<string, integer>

Currently, struct<> is being handled as string. We should double check whether BigQuery will accept the following schema:

{
  "fields": [],
  "mode": "REQUIRED",
  "type": "RECORD"
}
fbertsch commented 5 years ago

Upon investigation, BQ does not allow for empty structs. The following fails with BigQuery error in mk operation: Field a.b is type RECORD but has no schema.

tee schema_init.json <<EOF
[
    {
        "description": "first field added",
        "name": "a",
        "type": "RECORD",
        "mode": "REQUIRED",
        "fields": [
            {
                "name": "b",
                "type": "RECORD",
                "mode": "NULLABLE",
                "fields": []
            }
        ]
    }
]
EOF

tee schema_update.json <<EOF
[
    {
        "description": "first field added",
        "name": "a",
        "type": "RECORD",
        "mode": "REQUIRED",
        "fields": [
            {
                "name": "b",
                "type": "RECORD",
                "mode": "NULLABLE",
                "fields": [
                    {
                        "name": "c",
                        "type": "INT64",
                        "mode": "NULLABLE"
                    }
                ]
            }
        ]
    }
]
EOF

bq rm -f -r schema_test
bq mk schema_test
bq mk --schema schema_init.json -t schema_test.test_table

bq show schema_test.test_table

bq update --schema schema_update.json -t schema_test.test_table

bq show schema_test.test_table

My recommendation from here is to hard fail in the transpiler on empty structs, since if they try and be schema-evolved it will not be compatible with string.

acmiyaguchi commented 5 years ago

I've brought up this point before in https://github.com/mozilla/gcp-ingestion/pull/536#issuecomment-486367384, I think it's equally valid to drop or cast. It also looks like there's relevant discussion in https://github.com/mozilla/jsonschema-transpiler/pull/61#issue-274932502 on whether the program should panic in this particular case. I think #55 is the right solution, providing an error handling mechanism to drop, cast, or panic on under-specified fields.

Currently, an empty object will be cast into a string so it can be access via JSON_EXTRACT. If the object is changed in the future, it'll require a version bump and/or a table migration. It sounds like we might want to omit incompatible/placeholder types all together so that the field is moved into the additionalProperties section of the metadata field. This seems to be better for forward compatibility. Taking a stronger stance by hard failing would mean going back into mozilla-pipeline-schemas and fixing all of the ambiguous schemas.