smohler / lernspark

Have you ever wondered what the f#*k Apache Spark, Docker, CI/CD, and modern big data architectures look are? Me too!

MIT License

0 stars 0 forks source link

readme

lernspark

Have you ever wondered the the fuck Apache Spark, Docker, CI/CD, and modern big data architectures look are? Me too! To better understand I made this application that does a few things to help make it clear.

Make some random parquet data quickly defined by you.
Upload that data to an S3 bucket
Define a pipeline on the data
Execute that pipline on a subset of the S3 data for testing
Dockerize your pipeline to make an image!
Upload to ECR
Configure a Step Function to run the container for new uploads to S3 bucket or rerun if the container version changes
Make changes to the pipeline? Run a GitHub Action when you merge to main

Installation

To install lernspark you should just for now clone the repo and run some of the scripts.

cd ~
git clone https://github.com/smohler/lernspark.git
cd lernspark
chmod +x ~/lernspark/scripts/macOS/bootstrap.sh
~/lernspark/scripts/macOS/bootstrap.sh

After bootstrap.sh runs you will have some commands loaded in your environment to help explor and interact with lernspark.