Decide how to handle union types (like [string, int]) in BQ

jklukas commented 5 years ago

Currently, we let fields go to additional_properties if they are a union like [string, int]. We would like a way to include such fields in the table structure.

Options for how to express in BQ

One option is to default to string in this case, so 4 and "4" in the JSON both become string 4. In this case, we lose information about what the original type was, but that doesn't seem terribly important.

A variant on coercing to strings is that we could have it be a "JSON-formatted string field" such that 4 in JSON becomes string 4 and "4" in JSON becomes string"4" with the quotes retained. That would allow us to maintain original type information from the JSON, and maybe we'd be able to use JSON_EXTRACT functions on the field. The extra effort here doesn't seem worth it for the original type information.

Another option is to turn this into a STRUCT<int INT64, string STRING> where only one of the values will be non-null.

Options for naming the fields

I'm not sure if we've previously discussed the idea of potentially modifying the field name as a way to give ourselves flexibility to change how we want to encode in the future without having to change the type of a field.

For example, if we decided to use a struct, we could change field field_name to field_name_struct in the output BQ schema so that if we decide that some other representation works better, we could add it with a different name rather than having to change the type of field_name, necessitating recreating tables.

cc @acmiyaguchi

acmiyaguchi commented 5 years ago

I would prefer using a JSON (string) type to encode a variant types. This functionality is part of the design and is currently being used for histograms in the main ping. The lack of type information is fine, because all incoming pings are JSON anyways. Casting is also a behavior that handles nested objects e.g. if "type": ["int", "string"] can be represented as "oneOf": [{"type": "int"}, {"type": "string"}], then "oneOf": [ {"type": "object", "properties": {"foo": {"type": "int"}}}, {"type": "string}] is also a valid union type to handle. Casting the property into the string leaves the specifics to the user via JSON_EXTRACT.

If appropriate, I think we should use mozilla-schema-generator to fill in sections where variant map types occur, like scalars or user preferences. This skirts around the problem, but it's generally the right choice for our use-case because of the data dictionary.

The use of a struct to encode the variant types is an interesting solution, but I don't think there are that many usability gains over casting into a STRING other than the strong typing. The user still has to know what the type is ahead of time in order to use the struct properly (trying all combinations until the result is not null, perhaps), and it has the possibility of column bloat.

jklukas commented 5 years ago

I was worried that JSON_EXTRACT would not be able to handle a string not wrapped in {} or [], but it appears the following works:

SELECT JSON_EXTRACT('4', '$')

And it returns string 4. So casting to JSON strings still works for simple types, which is good.

So I guess we're at the point of figuring out how we reach a decision about committing to this representation such that we're comfortable deploying it and not being able to back it out.

acmiyaguchi commented 5 years ago

I think the decision should be left to the developers of the schema, though the option to explicitly cast with the --resolve drop is currently not supported. One idea is to have a json format in the JSON schema that is special cased during ast::Union::collapse to prevent dropping. An example usage may look like this:

{
    "type": "object",
    "properties": {
        "event": {
            "oneOf": [
                {
                    "type": "string", 
                    "format": "json"
                },
                {
                    "type": "array", 
                    "items": {
                        "type": "array", 
                        "items": {
                            "type": "string"
                        }
                    }
                }
            ]
        }
    }
}

Unfortunately, this relaxes the schema and allows for potentially invalid data. The following examples would all be allowed:

{"events": [["1", "a"], ["2", "b"]]}
{"events": "[[\"1\", \"a\"], [\"2\", \"b\"]]"}
{"events": "asdf"}

acmiyaguchi commented 5 years ago

The one case where this would help (and that we currently do not support) are maps of variable types. For example:

{
  "type": "object",
  "additionalProperties": {
    "type": ["null", "boolean", "integer", "string"]
   }
}

will be empty. This is a similarly structured schema that is well defined:

$ echo '{"type": "object", "additionalProperties": { "type": "string" } }' | jsonschema-transpiler --type bigquery
[
  {
    "fields": [
      {
        "mode": "REQUIRED",
        "name": "key",
        "type": "STRING"
      },
      {
        "mode": "REQUIRED",
        "name": "value",
        "type": "STRING"
      }
    ],
    "mode": "REPEATED",
    "name": "root",
    "type": "RECORD"
  }
]

jklukas commented 5 years ago

It looks like BigQuery does have a method for handling union types coming from Avro:

When union only has one non-null type, it converts to a nullable field. Otherwise it converts to a RECORD with a list of nullable fields. Only one of these fields will be set at read time.

This sounds similar to one of the suggestions above (STRUCT<int INT64, string STRING>), but I'm not sure if it's the same thing.

But, we have hewed towards BQ's decisions on how to convert Avro types before (turning maps into repeated key/value structs) and I hesitate to deviate in this case.

acmiyaguchi commented 5 years ago

That's great, the whole "collapse" algorithm was built to specifically get around the lack of support for variant types. The choices in this project have been mostly conservative (aside from object casting and case normalization) with respect to the BigQuery constraints.

The Avro into BigQuery pathway is the de-facto specification, since Avro is more expressive than BigQuery (maps and unions come to mind) and the import pathway is officially supported by BigQuery. I think it would be useful to implement #63 as part of handling unions.

jklukas commented 5 years ago

Tested BQ's avro behavior. Created an Avro schema with a union type:

{
  "type" : "record",
  "name" : "union_schema",
  "namespace" : "com.example",
  "fields" : [ {
    "name" : "field1",
      "type" : ["null", "boolean", "int", "long", "float", "double", "bytes", "string" ]
  } ]
}

I created a binary avro file containing a few records matching this schema:

{"field1": {"boolean": false}}
{"field1": {"int": 15}}
{"field1": {"long": 13}}
{"field1": {"float": 14.1}}
{"field1": {"double": 12.3}}
{"field1": {"string": "foo"}}

And loaded that file to a new table in BQ:

bq load --source_format=AVRO --project_id moz-fx-data-shar-nonprod-efed tmp.klukas_avro_test records.avro

This is what I saw in the BQ schema browser:

Field name	Type	Mode
field1	RECORD	NULLABLE
field1. boolean_value	BOOLEAN	NULLABLE
field1. int_value	INTEGER	NULLABLE
field1. long_value	INTEGER	NULLABLE
field1. float_value	FLOAT	NULLABLE
field1. double_value	FLOAT	NULLABLE
field1. bytes_value	BYTES	NULLABLE
field1. string_value	STRING	NULLABLE

So it appends _value to the avro type name. Both int and long map to INTEGER in BQ, both float and double map to FLOAT.

jklukas commented 5 years ago

Here is a query which pulls out the [boolean, integer] union field foreignInstall into the above avro-like form via a SQL query (and also extracts event content per https://github.com/mozilla/jsonschema-transpiler/issues/38):

CREATE TEMP FUNCTION
  udf_js_json_extract_events (input STRING)
  RETURNS ARRAY<STRUCT< f0_ INT64,
  f1_ STRING,
  f2_ STRING,
  f3_ STRING,
  f4_ STRING,
  f5_ ARRAY<STRUCT<key STRING,
  value STRING>> >>
  LANGUAGE js AS """
    if (input == null) {
      return null;
    }
    var parsed = JSON.parse(input);
    var result = [];
      for (var event of parsed) {
        map_values = []
        for (var key in event[5]) {
          map_values.push({"key": key, "value": event[5][key]})
        }
        var structured = {
          "f0_": event[0],
          "f1_": event[1],
          "f2_": event[2],
          "f3_": event[3],
          "f4_": event[4],
          "f5_": map_values
        }
        result.push(structured)
    }
    return result;
""";
  --
CREATE OR REPLACE TABLE
  `moz-fx-data-shared-prod.analysis.klukas_event_raw`
PARTITION BY
  DATE(submission_timestamp)
CLUSTER BY
  sample_id AS
SELECT
  * REPLACE((
    SELECT
      AS STRUCT payload.*,
      STRUCT( udf_js_json_extract_events(json_EXTRACT(additional_properties,
            '$.payload.events.parent')) AS parent,
        udf_js_json_extract_events(json_EXTRACT(additional_properties,
            '$.payload.events.content')) AS content,
        udf_js_json_extract_events(json_EXTRACT(additional_properties,
            '$.payload.events.extension')) AS extension,
        udf_js_json_extract_events(json_EXTRACT(additional_properties,
            '$.payload.events.gpu')) AS gpu,
        udf_js_json_extract_events(json_EXTRACT(additional_properties,
            '$.payload.events.dynamic')) AS dynamic ) AS events) AS payload, (
    SELECT
      AS STRUCT environment.* REPLACE ((
        SELECT
          AS STRUCT environment.addons.* REPLACE ((
            SELECT
              AS STRUCT environment.addons.theme.*,
              STRUCT(SAFE_CAST(JSON_EXTRACT(additional_properties,
                    '$.environment.addons.theme.foreignInstall') AS INT64) AS long_value,
                SAFE_CAST(JSON_EXTRACT(additional_properties,
                    '$.environment.addons.theme.foreignInstall') AS BOOLEAN) AS boolean_value) AS foreign_install) AS theme)) AS addons)) AS environment)
FROM
  `moz-fx-data-shared-prod.telemetry_stable.event_v4`
WHERE
  DATE(submission_timestamp) = "2019-08-26"

jklukas commented 5 years ago

And here's a query that filters using this foreign_install union type:

SELECT
  document_id,
  environment.addons.theme.foreign_install.boolean_value AS theme_foreign_install,
  parent_events.*
FROM
  `moz-fx-data-shared-prod.analysis.klukas_event`
CROSS JOIN
  UNNEST(payload.events.parent) AS parent_events
WHERE
  environment.addons.theme.foreign_install.boolean_value = TRUE
LIMIT
  10

jklukas commented 5 years ago

Seeking feedback on this from data users in #88

mozilla / jsonschema-transpiler

Decide how to handle union types (like [string, int]) in BQ #87

Options for how to express in BQ

Options for naming the fields