Add Speed Layer Piece of Data Pipeline

njfritter commented 7 months ago

Write data aggregated by Spark and written back to Kafka to a temporary caching (speed) layer that offers real time data; this will combine with batch data to serve as the backend for the Analytics UI.

Redis or Memcached seem like straightforward choices for now; AWS has an Elasticache offering that is compatible with both solutions.

njfritter commented 7 months ago

Depending on cost, S3 might also be a fine option (while also enabling a retention policy)

Edit: For the sake of a local implementation, will use a Postgres DB installed locally for this.

njfritter commented 7 months ago

According to https://en.wikipedia.org/wiki/Lambda_architecture (regarding the speed layer):

Output is typically stored on fast NoSQL databases,[6][7] or as a commit log.[8]
Dedicated stores used in the serving layer include Apache Cassandra, Apache HBase, Azure Cosmos DB, MongoDB, VoltDB or Elasticsearch for speed-layer output

I have just gotten the Postgres DB setup working, but may look into alternatives to Postgres for a proper implementation.

Edit: Potential local options are

TinyDB, a lightweight document oriented database written in pure Python
lowDB, based on a Javascript version of https://github.com/typicode/lowdb
FileXdb-Python, a lightweight local NoSQL database, optimized for best coding experience

Edit 2: Cassandra should actually work just fine for this as it is open source and can be easily installed and used locally.

njfritter commented 7 months ago

PR merged, closing.

njfritter / poc-data-pipelines

Add Speed Layer Piece of Data Pipeline #6