Closed acmiyaguchi closed 5 years ago
@acmiyaguchi doesn't having "additionalProperties": false
eliminate the ambiguity? For a map
type, the corresponding JSON Schema representation would simply be "type": "object"
.
@fbertsch You're right. There are probably a few cases that should be considered here:
{}
{"additionalProperties": false}
{"additionalProperties": true}
{"additionalProperties": {}}
{"additionalProperties": {"type": "integer"}
where the resulting types are
struct<>
struct<>
struct<>
map<string, struct<>>
map<string, integer>
Currently, struct<>
is being handled as string
. We should double check whether BigQuery will accept the following schema:
{
"fields": [],
"mode": "REQUIRED",
"type": "RECORD"
}
Upon investigation, BQ does not allow for empty structs. The following fails with BigQuery error in mk operation: Field a.b is type RECORD but has no schema
.
tee schema_init.json <<EOF
[
{
"description": "first field added",
"name": "a",
"type": "RECORD",
"mode": "REQUIRED",
"fields": [
{
"name": "b",
"type": "RECORD",
"mode": "NULLABLE",
"fields": []
}
]
}
]
EOF
tee schema_update.json <<EOF
[
{
"description": "first field added",
"name": "a",
"type": "RECORD",
"mode": "REQUIRED",
"fields": [
{
"name": "b",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
{
"name": "c",
"type": "INT64",
"mode": "NULLABLE"
}
]
}
]
}
]
EOF
bq rm -f -r schema_test
bq mk schema_test
bq mk --schema schema_init.json -t schema_test.test_table
bq show schema_test.test_table
bq update --schema schema_update.json -t schema_test.test_table
bq show schema_test.test_table
My recommendation from here is to hard fail in the transpiler on empty structs, since if they try and be schema-evolved it will not be compatible with string.
I've brought up this point before in https://github.com/mozilla/gcp-ingestion/pull/536#issuecomment-486367384, I think it's equally valid to drop or cast. It also looks like there's relevant discussion in https://github.com/mozilla/jsonschema-transpiler/pull/61#issue-274932502 on whether the program should panic in this particular case. I think #55 is the right solution, providing an error handling mechanism to drop, cast, or panic on under-specified fields.
Currently, an empty object will be cast into a string so it can be access via JSON_EXTRACT
. If the object is changed in the future, it'll require a version bump and/or a table migration. It sounds like we might want to omit incompatible/placeholder types all together so that the field is moved into the additionalProperties section of the metadata field. This seems to be better for forward compatibility. Taking a stronger stance by hard failing would mean going back into mozilla-pipeline-schemas and fixing all of the ambiguous schemas.
As per https://github.com/mozilla/mozilla-schema-generator/pull/25#issuecomment-499213896:
is casted into a JSON blob type in BigQuery.
Instead of this, the output could be an empty struct. In general, an empty object could be represented as an empty struct. This option would work in BigQuery since maps are an extension of the struct type, but this could cause ambiguity in the avro/parquet representation.