snowplow / snowplow-python-analytics-sdk

Python SDK for working with Snowplow enriched events in Spark, AWS Lambda et al.
21 stars 9 forks source link

RFC: Should the shredded JSON contain additional schema information? #31

Open miike opened 7 years ago

miike commented 7 years ago

The current JSON contexts shredding results in a simplified payload that results in a context name that includes the model along with the data e.g.,

    [
      ("context_com_acme_duplicated_1", [{"value": 1}, {"value": 2}]),
      ("context_com_acme_unduplicated_1", [{"unique": true}])
    ]

https://github.com/snowplow/snowplow-python-analytics-sdk/blob/master/snowplow_analytics_sdk/json_shredder.py#L102

This process is lossy and there are circumstances where the revision and addition components of the schemaver are important e.g., determining whether data is backwards compatible when running an aggregation, filtering/dropping on specific schema versions etc. Should the payload be restructured in include the schema version information (or more widely the schema information available to Redshift). Thoughts @chuwy ?

chuwy commented 7 years ago

Hey @miike,

I think argument about information loss is quite strong. I'd like to preserve revision and addition as long as possible.

There's a function in Scala SDK (not in Python yet), called transformWithInventory which basically extracts set of Iglu keys along with transformed JSON result. Column names are still same (version-lossy), but there's a good chance you can use information about shred types in something like Spark to identify Schema-compatibility issues. Do you think it can be a solution for use cases you mentioned?

miike commented 7 years ago

I think that makes sense though this would be a subset of that such that the transformation of the input line would yield the Iglu version in the output something like:

{
  "app_id": "test",
  "context_com_acme_duplicated_1": {
    "schema_version": "1-0-1",
    ...
    "data": {
        "value": 1
    }
    ....