Open jklukas opened 5 years ago
I would prefer using a JSON (string) type to encode a variant types. This functionality is part of the design and is currently being used for histograms in the main ping. The lack of type information is fine, because all incoming pings are JSON anyways. Casting is also a behavior that handles nested objects e.g. if "type": ["int", "string"]
can be represented as "oneOf": [{"type": "int"}, {"type": "string"}]
, then "oneOf": [ {"type": "object", "properties": {"foo": {"type": "int"}}}, {"type": "string}]
is also a valid union type to handle. Casting the property into the string leaves the specifics to the user via JSON_EXTRACT
.
If appropriate, I think we should use mozilla-schema-generator to fill in sections where variant map types occur, like scalars or user preferences. This skirts around the problem, but it's generally the right choice for our use-case because of the data dictionary.
The use of a struct to encode the variant types is an interesting solution, but I don't think there are that many usability gains over casting into a STRING other than the strong typing. The user still has to know what the type is ahead of time in order to use the struct properly (trying all combinations until the result is not null, perhaps), and it has the possibility of column bloat.
I was worried that JSON_EXTRACT would not be able to handle a string not wrapped in {}
or []
, but it appears the following works:
SELECT JSON_EXTRACT('4', '$')
And it returns string 4
. So casting to JSON strings still works for simple types, which is good.
So I guess we're at the point of figuring out how we reach a decision about committing to this representation such that we're comfortable deploying it and not being able to back it out.
I think the decision should be left to the developers of the schema, though the option to explicitly cast with the --resolve drop
is currently not supported. One idea is to have a json
format in the JSON schema that is special cased during ast::Union::collapse
to prevent dropping. An example usage may look like this:
{
"type": "object",
"properties": {
"event": {
"oneOf": [
{
"type": "string",
"format": "json"
},
{
"type": "array",
"items": {
"type": "array",
"items": {
"type": "string"
}
}
}
]
}
}
}
Unfortunately, this relaxes the schema and allows for potentially invalid data. The following examples would all be allowed:
{"events": [["1", "a"], ["2", "b"]]}
{"events": "[[\"1\", \"a\"], [\"2\", \"b\"]]"}
{"events": "asdf"}
The one case where this would help (and that we currently do not support) are maps of variable types. For example:
{
"type": "object",
"additionalProperties": {
"type": ["null", "boolean", "integer", "string"]
}
}
will be empty. This is a similarly structured schema that is well defined:
$ echo '{"type": "object", "additionalProperties": { "type": "string" } }' | jsonschema-transpiler --type bigquery
[
{
"fields": [
{
"mode": "REQUIRED",
"name": "key",
"type": "STRING"
},
{
"mode": "REQUIRED",
"name": "value",
"type": "STRING"
}
],
"mode": "REPEATED",
"name": "root",
"type": "RECORD"
}
]
It looks like BigQuery does have a method for handling union types coming from Avro:
When union only has one non-null type, it converts to a nullable field. Otherwise it converts to a RECORD with a list of nullable fields. Only one of these fields will be set at read time.
This sounds similar to one of the suggestions above (STRUCT<int INT64, string STRING>), but I'm not sure if it's the same thing.
But, we have hewed towards BQ's decisions on how to convert Avro types before (turning maps into repeated key/value structs) and I hesitate to deviate in this case.
That's great, the whole "collapse" algorithm was built to specifically get around the lack of support for variant types. The choices in this project have been mostly conservative (aside from object casting and case normalization) with respect to the BigQuery constraints.
The Avro into BigQuery pathway is the de-facto specification, since Avro is more expressive than BigQuery (maps and unions come to mind) and the import pathway is officially supported by BigQuery. I think it would be useful to implement #63 as part of handling unions.
Tested BQ's avro behavior. Created an Avro schema with a union type:
{
"type" : "record",
"name" : "union_schema",
"namespace" : "com.example",
"fields" : [ {
"name" : "field1",
"type" : ["null", "boolean", "int", "long", "float", "double", "bytes", "string" ]
} ]
}
I created a binary avro file containing a few records matching this schema:
{"field1": {"boolean": false}}
{"field1": {"int": 15}}
{"field1": {"long": 13}}
{"field1": {"float": 14.1}}
{"field1": {"double": 12.3}}
{"field1": {"string": "foo"}}
And loaded that file to a new table in BQ:
bq load --source_format=AVRO --project_id moz-fx-data-shar-nonprod-efed tmp.klukas_avro_test records.avro
This is what I saw in the BQ schema browser:
Field name | Type | Mode | Description |
---|---|---|---|
field1 | RECORD | NULLABLE | |
field1. boolean_value | BOOLEAN | NULLABLE | |
field1. int_value | INTEGER | NULLABLE | |
field1. long_value | INTEGER | NULLABLE | |
field1. float_value | FLOAT | NULLABLE | |
field1. double_value | FLOAT | NULLABLE | |
field1. bytes_value | BYTES | NULLABLE | |
field1. string_value | STRING | NULLABLE |
So it appends _value
to the avro type name. Both int
and long
map to INTEGER in BQ, both float
and double
map to FLOAT.
Here is a query which pulls out the [boolean, integer]
union field foreignInstall
into the above avro-like form via a SQL query (and also extracts event content per https://github.com/mozilla/jsonschema-transpiler/issues/38):
CREATE TEMP FUNCTION
udf_js_json_extract_events (input STRING)
RETURNS ARRAY<STRUCT< f0_ INT64,
f1_ STRING,
f2_ STRING,
f3_ STRING,
f4_ STRING,
f5_ ARRAY<STRUCT<key STRING,
value STRING>> >>
LANGUAGE js AS """
if (input == null) {
return null;
}
var parsed = JSON.parse(input);
var result = [];
for (var event of parsed) {
map_values = []
for (var key in event[5]) {
map_values.push({"key": key, "value": event[5][key]})
}
var structured = {
"f0_": event[0],
"f1_": event[1],
"f2_": event[2],
"f3_": event[3],
"f4_": event[4],
"f5_": map_values
}
result.push(structured)
}
return result;
""";
--
CREATE OR REPLACE TABLE
`moz-fx-data-shared-prod.analysis.klukas_event_raw`
PARTITION BY
DATE(submission_timestamp)
CLUSTER BY
sample_id AS
SELECT
* REPLACE((
SELECT
AS STRUCT payload.*,
STRUCT( udf_js_json_extract_events(json_EXTRACT(additional_properties,
'$.payload.events.parent')) AS parent,
udf_js_json_extract_events(json_EXTRACT(additional_properties,
'$.payload.events.content')) AS content,
udf_js_json_extract_events(json_EXTRACT(additional_properties,
'$.payload.events.extension')) AS extension,
udf_js_json_extract_events(json_EXTRACT(additional_properties,
'$.payload.events.gpu')) AS gpu,
udf_js_json_extract_events(json_EXTRACT(additional_properties,
'$.payload.events.dynamic')) AS dynamic ) AS events) AS payload, (
SELECT
AS STRUCT environment.* REPLACE ((
SELECT
AS STRUCT environment.addons.* REPLACE ((
SELECT
AS STRUCT environment.addons.theme.*,
STRUCT(SAFE_CAST(JSON_EXTRACT(additional_properties,
'$.environment.addons.theme.foreignInstall') AS INT64) AS long_value,
SAFE_CAST(JSON_EXTRACT(additional_properties,
'$.environment.addons.theme.foreignInstall') AS BOOLEAN) AS boolean_value) AS foreign_install) AS theme)) AS addons)) AS environment)
FROM
`moz-fx-data-shared-prod.telemetry_stable.event_v4`
WHERE
DATE(submission_timestamp) = "2019-08-26"
And here's a query that filters using this foreign_install
union type:
SELECT
document_id,
environment.addons.theme.foreign_install.boolean_value AS theme_foreign_install,
parent_events.*
FROM
`moz-fx-data-shared-prod.analysis.klukas_event`
CROSS JOIN
UNNEST(payload.events.parent) AS parent_events
WHERE
environment.addons.theme.foreign_install.boolean_value = TRUE
LIMIT
10
Seeking feedback on this from data users in #88
Currently, we let fields go to additional_properties if they are a union like [string, int]. We would like a way to include such fields in the table structure.
Options for how to express in BQ
One option is to default to string in this case, so
4
and"4"
in the JSON both become string4
. In this case, we lose information about what the original type was, but that doesn't seem terribly important.A variant on coercing to strings is that we could have it be a "JSON-formatted string field" such that
4
in JSON becomes string4
and"4"
in JSON becomes string"4"
with the quotes retained. That would allow us to maintain original type information from the JSON, and maybe we'd be able to use JSON_EXTRACT functions on the field. The extra effort here doesn't seem worth it for the original type information.Another option is to turn this into a
STRUCT<int INT64, string STRING>
where only one of the values will be non-null.Options for naming the fields
I'm not sure if we've previously discussed the idea of potentially modifying the field name as a way to give ourselves flexibility to change how we want to encode in the future without having to change the type of a field.
For example, if we decided to use a struct, we could change field
field_name
tofield_name_struct
in the output BQ schema so that if we decide that some other representation works better, we could add it with a different name rather than having to change the type offield_name
, necessitating recreating tables.cc @acmiyaguchi