Right now I have a quite clear vision on how run manifest should look like and what information it should contain. Instead of improving current DynamoDB implementation and having a chance someone ends up with corrupted manifest, I propose to this new implementation in its own namespace and deprecate previous one.
Very slow as we always need to download whole table and fold it
All above problems result in fact that it cannot be transactional, which is very important for our use case.
Proposal
There's nothing DB-specific in below functions and they simply can be expressed as an interface with different implementations: JDBC, DynamoDB, JSON-file (to backup or migrate for example), in-memory datastructure etc. I think this should be an excellent compromise for general/DB-specific implementation.
Each change/row - new record. Must not modify information in-place (we do it currently for strawberry)
Each change/row has: id, app, state, timestamp, analytics-sdk-version, data (optional)
app looks like strawberry-transformer-0.1.0: name and version
state can be: NEW, PROCESSING, PROCESSED (#37), whether this is loading or processing.
Triple (id, app, state) should have have an index, therefore it should be easy to query/slice it in efficient way
data can be arbitrary, but "expressable" via JSON-object/HashMap/case class. For example, for strawberry: (s3://data/runid-01, transformer-0.1.0, PROCESSED) - {"shredded_types": "LIST-OF-NEW-SHREDDED-TYPES"}
id has a requirement to uniquely identify subset of rows (events), whether this is S3 folder, window, glob file pattern or anything else (right now this is S3 path without bucket, which was a terrible mistake)
After this implemented in Analytics SDK - we can use in strawberry to replace its homegrown (but still compatible) implementation
Right now I have a quite clear vision on how run manifest should look like and what information it should contain. Instead of improving current DynamoDB implementation and having a chance someone ends up with corrupted manifest, I propose to this new implementation in its own namespace and deprecate previous one.
Current implementation's issues
fold
itProposal
There's nothing DB-specific in below functions and they simply can be expressed as an interface with different implementations: JDBC, DynamoDB, JSON-file (to backup or migrate for example), in-memory datastructure etc. I think this should be an excellent compromise for general/DB-specific implementation.
id
,app
,state
,timestamp
,analytics-sdk-version
,data
(optional)app
looks likestrawberry-transformer-0.1.0
: name and versionstate
can be:NEW
,PROCESSING
,PROCESSED
(#37), whether this is loading or processing.id
,app
,state
) should have have an index, therefore it should be easy to query/slice it in efficient waydata
can be arbitrary, but "expressable" via JSON-object/HashMap/case class. For example, for strawberry:(s3://data/runid-01, transformer-0.1.0, PROCESSED)
-{"shredded_types": "LIST-OF-NEW-SHREDDED-TYPES"}
id
has a requirement to uniquely identify subset of rows (events), whether this is S3 folder, window, glob file pattern or anything else (right now this is S3 path without bucket, which was a terrible mistake)After this implemented in Analytics SDK - we can use in strawberry to replace its homegrown (but still compatible) implementation