Fail mozilla-pipeline-schemas CI if schemas contain nullable array elements

relud commented 4 years ago

jsonschema can allow array elements to be null, but BigQuery can only make fields REPEATED or NULLABLE.

BigQuery parquet imports solve this by wrapping both the array and elements in structs, so that an array is transformed into a struct with one repeated field called list containing structs with one field element and both the outer struct and the element field can be NULLABLE while list is REPEATED.

The transpiler currently converts an array of nullable elements in a jsonschema to a REPEATED field in a BigQuery schema which cannot contain NULL. For example {"properties":{"mylist":{"items":{"type":["integer","null"]},"type":"array"}},"type":"object"} -> [{"mode":"REPEATED","name":"mylist","type":"INT64"}]. This causes issues where if a jsonschema allows a message that BigQuery rejects during a file load operation, the whole file is rejected.

This was discussed in the GCP Technical check-in on 2019-09-30 where it was determined that at this time due to backwards compatibility constraints the transpiler should error if schemas allow nullable array elements and mozilla-pipeline-schemas CI should fail if the transpiler can't transform schemas.

acmiyaguchi commented 4 years ago

If we do receive null elements during ingestion and we are not treating them as validation errors, then we would have to strip the null elements from the list as they are being transformed into a BigQuery row.

relud commented 4 years ago

If we do receive null elements during ingestion and we are not treating them as validation errors

which we don't have to worry about, because we decided we should treat them as validation errors for now.

then we would have to strip the null elements from the list as they are being transformed into a BigQuery row

or we would have to nest elements in a struct and null the field, like parquet imports.

removing nulls would cause issues if element indexes matter and we are only preserving array order. nesting elements would cause issues because it would change the schema of all our tables. nesting elements only when they are nullable and not already structs would cause issues if a schema were modified to become nullable, because normally that would be a backwards compatible change.

those are the issues that led us to decide not to support nullable array elements for now.

mozilla / jsonschema-transpiler

Fail mozilla-pipeline-schemas CI if schemas contain nullable array elements #91