kreuzwerker / kreuzlaker

11 stars 2 forks source link

Integrate RDBMS-via-cdc data into the data lake #7

Open fabdy opened 1 year ago

fabdy commented 1 year ago

A very common pattern is probably a PG/MySQL database which holds the transactional data and which should be inputted into the data lake. For that we would like to have a PG DB which regularly gets some data changes (lambda) and streams these changes into the data lake into a raw-raw s3 place (via data DMS). From there we again want to transform these to an event table in parquet. It would also be nice to have a way to get the latest info per table (e.g. a query which uses a primary key and gets the latest row for that).

DoD