snowplow / snowplow-scala-analytics-sdk

Scala SDK for working with Snowplow enriched events in Spark, AWS Lambda, Flink et al.
https://snowplow.github.io/snowplow-scala-analytics-sdk/
20 stars 14 forks source link

Add next version of run manifest #45

Closed chuwy closed 5 years ago

chuwy commented 6 years ago

Right now I have a quite clear vision on how run manifest should look like and what information it should contain. Instead of improving current DynamoDB implementation and having a chance someone ends up with corrupted manifest, I propose to this new implementation in its own namespace and deprecate previous one.

Current implementation's issues

  1. Tied to DynamoDB
  2. Doesn't have state (https://github.com/snowplow/snowplow-scala-analytics-sdk/issues/37) or any other addititional information (strawberry uses "compatible" implementation, but still its own)
  3. Very slow as we always need to download whole table and fold it
  4. All above problems result in fact that it cannot be transactional, which is very important for our use case.

Proposal

There's nothing DB-specific in below functions and they simply can be expressed as an interface with different implementations: JDBC, DynamoDB, JSON-file (to backup or migrate for example), in-memory datastructure etc. I think this should be an excellent compromise for general/DB-specific implementation.

  1. Each change/row - new record. Must not modify information in-place (we do it currently for strawberry)
  2. Each change/row has: id, app, state, timestamp, analytics-sdk-version, data (optional)
  3. app looks like strawberry-transformer-0.1.0: name and version
  4. state can be: NEW, PROCESSING, PROCESSED (#37), whether this is loading or processing.
  5. Triple (id, app, state) should have have an index, therefore it should be easy to query/slice it in efficient way
  6. data can be arbitrary, but "expressable" via JSON-object/HashMap/case class. For example, for strawberry: (s3://data/runid-01, transformer-0.1.0, PROCESSED) - {"shredded_types": "LIST-OF-NEW-SHREDDED-TYPES"}
  7. id has a requirement to uniquely identify subset of rows (events), whether this is S3 folder, window, glob file pattern or anything else (right now this is S3 path without bucket, which was a terrible mistake)

After this implemented in Analytics SDK - we can use in strawberry to replace its homegrown (but still compatible) implementation

chuwy commented 5 years ago

We have a separate project for it now https://github.com/snowplow-incubator/snowplow-processing-manifest/