mozilla / jsonschema-transpiler

Compile JSON Schema into Avro and BigQuery schemas
Mozilla Public License 2.0
43 stars 9 forks source link

Support tuple validation in jsonschema arrays #38

Closed acmiyaguchi closed 5 years ago

acmiyaguchi commented 5 years ago

In mps: https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/63028f589c9819f03c0c71d642517f0f22953cb5/validation/pocket/fire-tv-events.1.sample.pass.json#L12-L17

is validated by this:

https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/63028f589c9819f03c0c71d642517f0f22953cb5/schemas/pocket/fire-tv-events/fire-tv-events.1.schema.json#L10-L50

This currently generates the following error: thread 'main' panicked at 'calledResult::unwrap()on anErrvalue: Error("invalid type: sequence, expected struct Tag", line: 0, column: 0)', src/libcore/result.rs:1009:5

See: https://json-schema.org/understanding-json-schema/reference/array.html#tuple-validation

acmiyaguchi commented 5 years ago

Fixed in #37, it will almost always be casted into JSON.

acmiyaguchi commented 5 years ago

While this is no longer immediately broken, it should be supported properly as either an anonymous struct (_0, _1, _2, etc based on the min/max number of fields), or a regular struct with a single list item. The latter will require coordination from the decoding piece in mozilla/gcp-ingestion.

relud commented 5 years ago

an anonymous struct (_0, _1, _2, etc based on the min/max number of fields)

if we do this we should follow the field naming that BigQuery uses for anonymous structs (_f0, _f1, _f2, etc)

or a regular struct with a single list item

As discussed in the team meeting today, we could use a oneOf condition to specify a struct alongside tuples, providing both an upgrade path and the necessary info to coerce tuples to structs.

acmiyaguchi commented 5 years ago

As discussed in the team meeting today, we could use a oneOf condition to specify a struct alongside tuples, providing both an upgrade path and the necessary info to coerce tuples to structs.

This is a nice solution. It seems like there should be some idea of precedence among the different types, such as the Union[Tuple, Object] or Union[Tuple, String]. This should be handled by ast::Union::collapse:

https://github.com/mozilla/jsonschema-transpiler/blob/536c6fd9fb9095b91bb92fca17bd80927eceb873/src/ast.rs#L91-L98

I'm a bit wary of adding too much custom logic into the encoder, but this seems like it strikes a good balance.

jklukas commented 5 years ago

if we do this we should follow the field naming that BigQuery uses for anonymous structs (_f0, _f1, _f2, etc)

If we do this, we could still support nice naming for event-type fields at the view layer. So for the event ping, the telemetry.events view would replace the _f0, etc. names with more descriptive variants.

jklukas commented 5 years ago

I'm working on a POC of what one day of event would look like with this scheme. I'm going to write up a short proposal to share the interface with data scientists and gather any feedback.

Here's the query to create what the _stable table will look like:

CREATE TEMP FUNCTION
  udf_js_json_extract_events (input STRING)
  RETURNS ARRAY<STRUCT< f0_ INT64,
  f1_ STRING,
  f2_ STRING,
  f3_ STRING,
  f4_ STRING,
  f5_ ARRAY<STRUCT<key STRING,
  value STRING>> >>
  LANGUAGE js AS """
    if (input == null) {
      return null;
    }
    var parsed = JSON.parse(input);
    var result = [];
      for (var event of parsed) {
        map_values = []
        for (var key in event[5]) {
          map_values.push({"key": key, "value": event[5][key]})
        }
        var structured = {
          "f0_": event[0],
          "f1_": event[1],
          "f2_": event[2],
          "f3_": event[3],
          "f4_": event[4],
          "f5_": map_values
        }
        result.push(structured)
    }
    return result;
""";
CREATE or replace TABLE
  `moz-fx-data-shared-prod.analysis.klukas_event_raw`
PARTITION BY
  DATE(submission_timestamp)
CLUSTER BY
  sample_id AS
SELECT
  * REPLACE((
    SELECT
      AS STRUCT payload.*,
      STRUCT( udf_js_json_extract_events(json_EXTRACT(additional_properties,
            '$.payload.events.parent')) AS parent,
        udf_js_json_extract_events(json_EXTRACT(additional_properties,
            '$.payload.events.content')) AS content,
        udf_js_json_extract_events(json_EXTRACT(additional_properties,
            '$.payload.events.extension')) AS extension,
        udf_js_json_extract_events(json_EXTRACT(additional_properties,
            '$.payload.events.gpu')) AS gpu,
        udf_js_json_extract_events(json_EXTRACT(additional_properties,
            '$.payload.events.dynamic')) AS dynamic ) AS events) AS payload)
FROM
  `moz-fx-data-shared-prod.telemetry_stable.event_v4`
WHERE
  DATE(submission_timestamp) = "2019-08-26"
jklukas commented 5 years ago

On top of the above table (which gives names f0_, etc. to fields), here is a view that gives logical names to the fields and additionally parses the timestamp value into a BQ TIMESTAMP type:

SELECT
  * REPLACE((
    SELECT
      AS STRUCT payload.* REPLACE ( STRUCT(ARRAY(
          SELECT
            AS STRUCT f0_ AS event_timestamp,
            TIMESTAMP_MILLIS(payload.process_start_timestamp + f0_) AS event_timestamp_parsed,
            f1_ AS event_category,
            f2_ AS event_method,
            f3_ AS event_object,
            f4_ AS event_string_value,
            f5_ event_map_values
          FROM
            UNNEST(payload.events.parent)) AS parent) AS events)) AS payload)
FROM
  `moz-fx-data-shared-prod.analysis.klukas_event_raw`

Users would then likely use this view with an UNNEST like:

SELECT
  document_id,
  parent_events.*
FROM
  `moz-fx-data-shared-prod.analysis.klukas_event`
CROSS JOIN
  UNNEST(payload.events.parent) AS parent_events
LIMIT
  2
Row document_id event_timestamp event_timestamp_parsed event_category event_method event_object event_string_value event_map_values.key event_map_values.value  
1 5f134697-3831-44b3-9416-9c4910b84fbb 266415 2019-08-25 20:21:26.415 UTC uptake.remotecontent.result uptake remotesettings network_error source settings-changes-monitoring  
                trigger timer  
2 b311fd60-bdc9-402d-861a-51f1bed7be0a 180960 2019-08-26 01:26:00.960 UTC uptake.remotecontent.result uptake remotesettings success source settings-changes-monitoring  
                trigger timer  
                age 260558
jklukas commented 5 years ago

Seeking feedback on this from data users in #88