njfritter / poc-data-pipelines

Proof-of-Concept (POC) Data Pipelines for various use cases such as data streaming/ingestion, batch data processing, orchestration and storage. Includes technologies such as Apache Airflow, Apache Spark, Apache Kafka, AWS, Python and more
0 stars 0 forks source link

Proof of Concept (POC) Data Pipelines

This repo is dedicated to testing out data technologies as well as highlighting my proficiency at building various types of data pipelines.

To start, I will be exploring simpler use cases with a combination of technologies that I have varying amounts of experience with. This will allow me to learn nuances and functionality of certain data technologies I have less experience with (i.e. streaming data use cases) while also learning how to piece them together with other technologies I have more experience with (i.e. batch data processing).

I will leverage the power of the cloud to simulate "production" conditions for these pipelines as much as possible.

I plan on using the information gained to tackle more complex use cases (including domains I am generally interested in), which will be placed in separate repos.

All of these pipelines will be guided by simulated "business use cases" that might be posed to a data engineer by a product organization, team of analysts, etc.

Option for generating pseudo-real data (real data generated in a fake way): EventSim

The Pipelines

This section will be updated as I build out each of the pipelines:

  1. Kafka Spark Streaming Pipeline with data from Coinbase API