mozilla / gcp-ingestion

Documentation and implementation of telemetry ingestion on Google Cloud Platform
https://mozilla.github.io/gcp-ingestion/
Mozilla Public License 2.0
79 stars 32 forks source link

Consider developing a schema for the metadata added by the decoder. #499

Closed mreid-moz closed 5 years ago

mreid-moz commented 5 years ago

Breaking out part of the discussion in #477:

Consider developing a JSONSchema for the metadata added by the pipeline, which we could version based on the ingestion code and incorporate into the schema transpiler.

acmiyaguchi commented 5 years ago

One idea to start is to programmatically insert the metadata section into the avro schema as a Map<String, String> here:

https://github.com/mozilla/gcp-ingestion/blob/dc7429954b8f2bd5d0a87b2867105b0e2c2001a6/ingestion-beam/src/main/java/com/mozilla/telemetry/avro/PubsubMessageRecordFormatter.java#L27-L32

A downside of this approach is the generated BigQuery schema in mps will be different by this new section. Additionally, once the map is converted into a repeated key-value struct in BigQuery, it'll likely be tricky to convert into a struct in the case of a table migration without a UDF.

jklukas commented 5 years ago

One idea to start is to programmatically insert the metadata section into the avro schema as a Map<String, String>

Given the evolving conversation in the past few days, I'm feeling pretty confident that we have a way forward for defining the nested metadata structure in a JSON schema, and merging that in as part of generation of avro and bq schemas. So I think we can safely avoid taking the intermediate step of adding a metadata map to the avro schema.

jklukas commented 5 years ago

Added in https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/312