snowplow / snowplow-badrows

Apache License 2.0
7 stars 2 forks source link

Dependency on analytics SDK? #53

Open alexanderdean opened 3 years ago

alexanderdean commented 3 years ago

The idea of a dependency on the analytics SDK seems odd to me.

I think of bad rows as a core part of the pipeline, versus the analytics SDK as a downstream consumer of the upstream core pipeline outputs.

What am I missing?

dilyand commented 3 years ago

I think the only reason why we depend on the analytics sdk here is that the sdk is where we define a class for the canonical Snowplow event.

This is quite a chunky class and it makes sense that we should only have it defined in one place for maintainability. I believe it was first introduced in the sdk because the need for it there is quite obvious; and then we kept reusing it elsewhere.

Maybe it would make sense to have it as mini-library that can be depended on instead of the full sdk.

benjben commented 3 years ago

Some bad rows (e.g. loader bad rows) contain the enriched event (e.g. here and soon here).

This enriched event is defined in the analytics SDK here.

We plan to emit this case class also in enrich, so that there is a single source of truth to define an enriched event.

Analytics SDK can be seen as the definition of an enriched events + JSON/TSV encoders/decoders, and we need JSON ones in bad rows library.

benjben commented 3 years ago

Hehe Dilyan was quicker

alexanderdean commented 3 years ago

I agree with @dilyand - the enriched event should be its own mini-library, and that should be pulled into bad-rows and the analytics-sdk.

The analytics sdk should be a downstream consumer of entities produced by upstream.

To put it another way: the fact that the enriched event was first defined in the analytics SDK, not in an upstream library, is a bug not a feature.

dilyand commented 3 years ago

Thinking about it more, in light of what Ben said, the analytics sdk is indeed just that mini library: it has the definition of the canonical event, of unstruct event and other bits and pieces to do with them, and then just codecs to derive those classes from data and to produce data from the class instances.

In fact, there's no analytics at all in the sdk -- no aggregation, no statistics, no pivoting... Maybe we named it that way because we wanted to add those features in the future? But right now it just looks like a misnomer.

alexanderdean commented 3 years ago

The analytics SDKs are small libraries designed for downstream consumers of the Snowplow technology:

https://docs.snowplowanalytics.com/docs/modeling-your-data/analytics-sdk/

It makes sense for the Scala analytics SDK to be a very lightweight shim over core libraries written in Scala that the upstream pipeline uses.

To me it doesn't make sense at all for an 'analytics SDK' to be embedded in the core of Snowplow, any more than I'd expect a tracker to be embedded in the core.

My proposal is to move the core event technology up into an upstream library and out of this analytics SDK, then add it in as a dependency.

If we don't do this, we will end up with a weird circular reference when we implement https://github.com/snowplow/snowplow-scala-analytics-sdk/issues/114

dilyand commented 3 years ago

move the core event technology up into an upstream library and out of this analytics SDK

Absolutely. I was just pointing out that when we move the core event tech out, there would be nothing left in this SDK. Unless it be the new functionality that would make it easy to work with bad rows. But then maybe that functionality should be part of the bad rows library and we don't need this SDK at all.

alexanderdean commented 3 years ago

This is what I would do:

The thing I would keep in the Scala Analytics SDK are the lossy/opionated formats for shredding/toJsoning our event format. At the moment we only have one flavour of that transformation code (e.g. in this file) but we will likely add more as we add more downstream targets (e.g. loaders, relays), and enrich does not need to know about any of those formats (because they are consumer-specific), so they don't need to live in snowplow-enriched-event