Closed acmiyaguchi closed 5 years ago
Fixed in #37, it will almost always be casted into JSON.
While this is no longer immediately broken, it should be supported properly as either an anonymous struct (_0
, _1
, _2
, etc based on the min/max number of fields), or a regular struct with a single list item. The latter will require coordination from the decoding piece in mozilla/gcp-ingestion.
an anonymous struct (_0, _1, _2, etc based on the min/max number of fields)
if we do this we should follow the field naming that BigQuery uses for anonymous structs (_f0
, _f1
, _f2
, etc)
or a regular struct with a single list item
As discussed in the team meeting today, we could use a oneOf
condition to specify a struct alongside tuples, providing both an upgrade path and the necessary info to coerce tuples to structs.
As discussed in the team meeting today, we could use a
oneOf
condition to specify a struct alongside tuples, providing both an upgrade path and the necessary info to coerce tuples to structs.
This is a nice solution. It seems like there should be some idea of precedence among the different types, such as the Union[Tuple, Object]
or Union[Tuple, String]
. This should be handled by ast::Union::collapse
:
I'm a bit wary of adding too much custom logic into the encoder, but this seems like it strikes a good balance.
if we do this we should follow the field naming that BigQuery uses for anonymous structs (_f0, _f1, _f2, etc)
If we do this, we could still support nice naming for event-type fields at the view layer. So for the event ping, the telemetry.events
view would replace the _f0
, etc. names with more descriptive variants.
I'm working on a POC of what one day of event
would look like with this scheme. I'm going to write up a short proposal to share the interface with data scientists and gather any feedback.
Here's the query to create what the _stable table will look like:
CREATE TEMP FUNCTION
udf_js_json_extract_events (input STRING)
RETURNS ARRAY<STRUCT< f0_ INT64,
f1_ STRING,
f2_ STRING,
f3_ STRING,
f4_ STRING,
f5_ ARRAY<STRUCT<key STRING,
value STRING>> >>
LANGUAGE js AS """
if (input == null) {
return null;
}
var parsed = JSON.parse(input);
var result = [];
for (var event of parsed) {
map_values = []
for (var key in event[5]) {
map_values.push({"key": key, "value": event[5][key]})
}
var structured = {
"f0_": event[0],
"f1_": event[1],
"f2_": event[2],
"f3_": event[3],
"f4_": event[4],
"f5_": map_values
}
result.push(structured)
}
return result;
""";
CREATE or replace TABLE
`moz-fx-data-shared-prod.analysis.klukas_event_raw`
PARTITION BY
DATE(submission_timestamp)
CLUSTER BY
sample_id AS
SELECT
* REPLACE((
SELECT
AS STRUCT payload.*,
STRUCT( udf_js_json_extract_events(json_EXTRACT(additional_properties,
'$.payload.events.parent')) AS parent,
udf_js_json_extract_events(json_EXTRACT(additional_properties,
'$.payload.events.content')) AS content,
udf_js_json_extract_events(json_EXTRACT(additional_properties,
'$.payload.events.extension')) AS extension,
udf_js_json_extract_events(json_EXTRACT(additional_properties,
'$.payload.events.gpu')) AS gpu,
udf_js_json_extract_events(json_EXTRACT(additional_properties,
'$.payload.events.dynamic')) AS dynamic ) AS events) AS payload)
FROM
`moz-fx-data-shared-prod.telemetry_stable.event_v4`
WHERE
DATE(submission_timestamp) = "2019-08-26"
On top of the above table (which gives names f0_
, etc. to fields), here is a view that gives logical names to the fields and additionally parses the timestamp
value into a BQ TIMESTAMP type:
SELECT
* REPLACE((
SELECT
AS STRUCT payload.* REPLACE ( STRUCT(ARRAY(
SELECT
AS STRUCT f0_ AS event_timestamp,
TIMESTAMP_MILLIS(payload.process_start_timestamp + f0_) AS event_timestamp_parsed,
f1_ AS event_category,
f2_ AS event_method,
f3_ AS event_object,
f4_ AS event_string_value,
f5_ event_map_values
FROM
UNNEST(payload.events.parent)) AS parent) AS events)) AS payload)
FROM
`moz-fx-data-shared-prod.analysis.klukas_event_raw`
Users would then likely use this view with an UNNEST like:
SELECT
document_id,
parent_events.*
FROM
`moz-fx-data-shared-prod.analysis.klukas_event`
CROSS JOIN
UNNEST(payload.events.parent) AS parent_events
LIMIT
2
Row | document_id | event_timestamp | event_timestamp_parsed | event_category | event_method | event_object | event_string_value | event_map_values.key | event_map_values.value | |
---|---|---|---|---|---|---|---|---|---|---|
1 | 5f134697-3831-44b3-9416-9c4910b84fbb | 266415 | 2019-08-25 20:21:26.415 UTC | uptake.remotecontent.result | uptake | remotesettings | network_error | source | settings-changes-monitoring | |
trigger | timer | |||||||||
2 | b311fd60-bdc9-402d-861a-51f1bed7be0a | 180960 | 2019-08-26 01:26:00.960 UTC | uptake.remotecontent.result | uptake | remotesettings | success | source | settings-changes-monitoring | |
trigger | timer | |||||||||
age | 260558 |
Seeking feedback on this from data users in #88
In mps: https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/63028f589c9819f03c0c71d642517f0f22953cb5/validation/pocket/fire-tv-events.1.sample.pass.json#L12-L17
is validated by this:
https://github.com/mozilla-services/mozilla-pipeline-schemas/blob/63028f589c9819f03c0c71d642517f0f22953cb5/schemas/pocket/fire-tv-events/fire-tv-events.1.schema.json#L10-L50
This currently generates the following error:
thread 'main' panicked at 'called
Result::unwrap()on an
Errvalue: Error("invalid type: sequence, expected struct Tag", line: 0, column: 0)', src/libcore/result.rs:1009:5
See: https://json-schema.org/understanding-json-schema/reference/array.html#tuple-validation