snowplow / snowplow-python-analytics-sdk

Python SDK for working with Snowplow enriched events in Spark, AWS Lambda et al.
21 stars 9 forks source link

Adds support for 'Redshift' shredding format #32

Closed miike closed 8 months ago

miike commented 7 years ago

This is a WIP but I'd like to get some comments/thoughts on it based on #31

This introduces the concept of the 'redshift' (or a more appropriate name) shredding which more closely follows what is shredded into Redshift in terms of the table design. This differs from the current implementation which is more consistent with how data is sunk into Elasticsearch.

The shredding format introduced below makes the following changes:

As a sidenote the code likely needs some refactoring/reformatting and the existing docstrings for the methods have not been updated to reflect the new behaviour.

snowplowcla commented 7 years ago

@miike has signed the Software Grant and Corporate Contributor License Agreement

alexanderdean commented 7 years ago

Thanks @miike!

The 'contexts' and 'unstruct_event' prefixes for the JSON objects probably seem a bit odd in hindsight. If I remember correctly, the original design was partly because contexts could be an array of a given type, while unstruct event is always a singleton. In pseudocode:

+ enriched event
-> contexts(array[A], B, C, array[D])
-> unstruct_event(E)

So prefixing them with contexts or unstruct_event served a couple of purposes:

  1. Cheap (but clunky) way of indicating the source of this entity
  2. Tells you whether you can just access it as event.E or whether it has to be event.A[0] (i.e. an array)

I'm not particularly attached to the 'contexts' and 'unstruct_event' prefixes - I just wanted to share their origins...

alexanderdean commented 7 years ago

@chuwy - this looks pretty exciting! Can you do a first pass review on this please?

miike commented 7 years ago

@chuwy @alexanderdean What do you think of adding similar functionality (including data + schema) for the Scala Analytics SDK? Is there a bit of additional information we should include to reference the source (contexts/unstruct)?

alexanderdean commented 7 years ago

Sure - worth creating placeholder tickets to track the same idea in the Scala and .NET Analytics SDKs...

chuwy commented 7 years ago

Hey @miike, I'm very sorry you had to wait so long. Idea looks really great. Though I'm agree with @alexanderdean that contexts_ prefix serves almost no purpose here. Therefore I'm thinking about something similar to following structure:

{
  "event": {
    "app_id": "foo",
    "platform": "mob",
    ...
  },
  "unstruct_event": {
      "schema": "iglu:com.acme/event/jsonschema/1-0-0",
      "data": {"key": "value"}
  },
  "contexts": [
    {
      "schema": "iglu:com.acme/context/jsonschema/1-0-0",
      "data": {"key": "value"}
    },
    {
      "schema": "iglu:com.acme/context/jsonschema/1-0-0",
      "data": {"key": "value"}
    }
  ]
}

Here's some highlights:

  1. In your implementation shredded JSONs have vendor, name, etc schema metadata. While it's quite convenient to access fields this way - it's more common for schemas, not for data envelope. People too often confuse our data/schema envelopes, we don't want to introduce even more confusion. Instead we should have (and already have in Iglu Scala Core - parseSchemaKey) function that safely converts IgluUri string into vendor/name/format/version record.
  2. I see this format as a generalization of what we have now. In other words having following functions: parseSchemaKey and imaginary schemaKeyToElasticsearchKey we can derive current format via EnrichedTsv -> Redshift -> Elasticsearch as "Redshift" doesn't loose any data, unlike "Elasticsearch". I like it very much.
  3. Instead of elasticsearch-keys I propose to use two separate keys for shredded types array. We just need a schema and we'll be able to query event with something like SchemaCriterion (iglu:com.acme/event/jsonschema/2-?-?)
  4. We really really need to come up with better names than "Redshift" and "Elasticsearch". Better now than later.

Overall, I like this idea very much.

alexanderdean commented 7 years ago

Thanks @chuwy for the detailed review.

I agree with pretty much everything.

We really really need to come up with better names than "Redshift" and "Elasticsearch". Better now than later.

&&

we can derive current format via EnrichedTsv -> Redshift -> Elasticsearch

The second one sounds odd because, as you say, the Redshift format is not really "Redshift" - it's just a pure-JSON intermediate form. As well as renaming, we should make this intermediate form self-describing and register its schema in Iglu Central.

better names than ... "Elasticsearch"

I think there are some Elasticsearch-isms in this format (the geopoint?), so I'm not so bothered by that name.

miike commented 7 years ago

Thanks guys - definitely agreed on coming up with some clearer names for this stuff.

I'm not sure about the semi-nested format for contexts - I think it definitely makes sense from a data structure point of view but may make it more difficult/intensive for applications using the analytics SDK to then read off what they are interested in (e.g., if only one context is of interest but you need to iterate through 5 additional contexts in the contexts object this seems like more work). The other reason I'm leaning towards the context-as-a-column model is because it gives a predictable structured format (or is this the responsibility of the downstream consumer?) that can be used to sink in to databases like BigQuery.

alexanderdean commented 7 years ago

I see what you are saying @miike... It's nice to be able to "dot operator all the things".

Anton is off for the next fortnight, so I suspect we will return to this then.

miike commented 7 years ago

@alexanderdean Definitely nice - but I wonder if the contexts-as-a-column is too strictly opinionated? Possibly grounds for having multiple intermediate (self-describing) formats depending on what the use case is. See you in a fortnight Anton!