Add Batch Layer Piece of Data Pipeline

njfritter / poc-data-pipelines

Proof-of-Concept (POC) Data Pipelines for various use cases such as data streaming/ingestion, batch data processing, orchestration and storage. Includes technologies such as Apache Airflow, Apache Spark, Apache Kafka, AWS, Python and more

0 stars 0 forks source link

Add Batch Layer Piece of Data Pipeline #4

Open njfritter opened 9 months ago

njfritter commented 9 months ago

Create the batch layer of data written to Kafka in a separate persisted data store.

Given that I plan on using Snowflake for batch data pipelines down the line, I may just use Snowflake and write to a table separate from any later tables (and would distinguish this table in a separate schema from the batch tables).

Edit: For the sake of doing a local implementation, I will use a Postgres DB which can be setup locally (rather than a cloud based solution like Snowflake/Redshift/etc.)

njfritter commented 9 months ago

S3 could also be an option here depending on cost

njfritter commented 8 months ago

This ticket will include the following:

Writing raw (deduplicated?) data to Postgres
- May do this in the same processing script as the one that writes to Cassandra, or a separate one
A SQL script to aggregate the raw data into the same format as the speed layer and write to a "batch layer" table
- A lightweight Python script using psycopg2 should work
- Will need a lightweight way for this script to be run regularly ; the "production" (i.e. cloud) deployment can leverage Airflow/Prefect/etc.