njfritter / poc-data-pipelines

Proof-of-Concept (POC) Data Pipelines for various use cases such as data streaming/ingestion, batch data processing, orchestration and storage. Includes technologies such as Apache Airflow, Apache Spark, Apache Kafka, AWS, Python and more
0 stars 0 forks source link

Add Speed Layer Piece of Data Pipeline #6

Closed njfritter closed 7 months ago

njfritter commented 7 months ago

Write data aggregated by Spark and written back to Kafka to a temporary caching (speed) layer that offers real time data; this will combine with batch data to serve as the backend for the Analytics UI.

Redis or Memcached seem like straightforward choices for now; AWS has an Elasticache offering that is compatible with both solutions.

njfritter commented 7 months ago

Depending on cost, S3 might also be a fine option (while also enabling a retention policy)

Edit: For the sake of a local implementation, will use a Postgres DB installed locally for this.

njfritter commented 7 months ago

According to https://en.wikipedia.org/wiki/Lambda_architecture (regarding the speed layer):

I have just gotten the Postgres DB setup working, but may look into alternatives to Postgres for a proper implementation.

Edit: Potential local options are

Edit 2: Cassandra should actually work just fine for this as it is open source and can be easily installed and used locally.

njfritter commented 7 months ago

PR merged, closing.