Closed miike closed 8 months ago
@miike has signed the Software Grant and Corporate Contributor License Agreement
Thanks @miike!
The 'contexts' and 'unstruct_event' prefixes for the JSON objects probably seem a bit odd in hindsight. If I remember correctly, the original design was partly because contexts could be an array of a given type, while unstruct event is always a singleton. In pseudocode:
+ enriched event
-> contexts(array[A], B, C, array[D])
-> unstruct_event(E)
So prefixing them with contexts
or unstruct_event
served a couple of purposes:
event.E
or whether it has to be event.A[0]
(i.e. an array)I'm not particularly attached to the 'contexts' and 'unstruct_event' prefixes - I just wanted to share their origins...
@chuwy - this looks pretty exciting! Can you do a first pass review on this please?
@chuwy @alexanderdean What do you think of adding similar functionality (including data + schema) for the Scala Analytics SDK? Is there a bit of additional information we should include to reference the source (contexts/unstruct)?
Sure - worth creating placeholder tickets to track the same idea in the Scala and .NET Analytics SDKs...
Hey @miike, I'm very sorry you had to wait so long. Idea looks really great. Though I'm agree with @alexanderdean that contexts_
prefix serves almost no purpose here. Therefore I'm thinking about something similar to following structure:
{
"event": {
"app_id": "foo",
"platform": "mob",
...
},
"unstruct_event": {
"schema": "iglu:com.acme/event/jsonschema/1-0-0",
"data": {"key": "value"}
},
"contexts": [
{
"schema": "iglu:com.acme/context/jsonschema/1-0-0",
"data": {"key": "value"}
},
{
"schema": "iglu:com.acme/context/jsonschema/1-0-0",
"data": {"key": "value"}
}
]
}
Here's some highlights:
vendor
, name
, etc schema metadata. While it's quite convenient to access fields this way - it's more common for schemas, not for data envelope.
People too often confuse our data/schema envelopes, we don't want to introduce even more confusion. Instead we should have (and already have in Iglu Scala Core - parseSchemaKey
) function that safely converts IgluUri string into vendor/name/format/version
record.parseSchemaKey
and imaginary schemaKeyToElasticsearchKey
we can derive current format via EnrichedTsv -> Redshift -> Elasticsearch
as "Redshift" doesn't loose any data, unlike "Elasticsearch". I like it very much.Overall, I like this idea very much.
Thanks @chuwy for the detailed review.
I agree with pretty much everything.
We really really need to come up with better names than "Redshift" and "Elasticsearch". Better now than later.
&&
we can derive current format via
EnrichedTsv
->Redshift
->Elasticsearch
The second one sounds odd because, as you say, the Redshift format is not really "Redshift" - it's just a pure-JSON intermediate form. As well as renaming, we should make this intermediate form self-describing and register its schema in Iglu Central.
better names than ... "Elasticsearch"
I think there are some Elasticsearch-isms in this format (the geopoint?), so I'm not so bothered by that name.
Thanks guys - definitely agreed on coming up with some clearer names for this stuff.
I'm not sure about the semi-nested format for contexts
- I think it definitely makes sense from a data structure point of view but may make it more difficult/intensive for applications using the analytics SDK to then read off what they are interested in (e.g., if only one context is of interest but you need to iterate through 5 additional contexts in the contexts object this seems like more work). The other reason I'm leaning towards the context-as-a-column model is because it gives a predictable structured format (or is this the responsibility of the downstream consumer?) that can be used to sink in to databases like BigQuery.
I see what you are saying @miike... It's nice to be able to "dot operator all the things".
Anton is off for the next fortnight, so I suspect we will return to this then.
@alexanderdean Definitely nice - but I wonder if the contexts-as-a-column is too strictly opinionated? Possibly grounds for having multiple intermediate (self-describing) formats depending on what the use case is. See you in a fortnight Anton!
This is a WIP but I'd like to get some comments/thoughts on it based on #31
This introduces the concept of the 'redshift' (or a more appropriate name) shredding which more closely follows what is shredded into Redshift in terms of the table design. This differs from the current implementation which is more consistent with how data is sunk into Elasticsearch.
The shredding format introduced below makes the following changes:
shred_format='elasticsearch'
by defaultAs a sidenote the code likely needs some refactoring/reformatting and the existing docstrings for the methods have not been updated to reflect the new behaviour.