josephmachado / data_engineering_best_practices

Sample project to demonstrate data engineering best practices
https://www.startdataengineering.com/post/de_best_practices/
161 stars 26 forks source link
data-engineering delta-lake etl great-expectations minio pyspark spark

Data Engineering Best Practices

Code for blog at Data Engineering Best Practices - #1. Data flow & Code

Project

Assume we are extracting customer and order information from upstream sources and creating a daily report of the number of orders.

Data architecture

Setup

If you'd like to code along, you'll need

Prerequisite:

  1. git version >= 2.37.1
  2. Docker version >= 20.10.17 and Docker compose v2 version >= v2.10.2. Make sure that docker is running using docker ps
  3. pgcli

Run the following commands via the terminal. If you are using Windows, use WSL to set up Ubuntu and run the following commands via that terminal.

git clone https://github.com/josephmachado/data_engineering_best_practices.git
cd data_engineering_best_practices
make up # Spin up containers
make ddl # Create tables & views
make ci # Run checks & tests
make etl # Run etl
make spark-sh # Spark shell to check created tables
spark.sql("select partition from adventureworks.sales_mart group by 1").show() // should be the number of times you ran `make etl`
spark.sql("select count(*) from businessintelligence.sales_mart").show() // 59
spark.sql("select count(*) from adventureworks.dim_customer").show() // 1000 * num of etl runs
spark.sql("select count(*) from adventureworks.fct_orders").show() // 10000 * num of etl runs
:q // Quit scala shell

You can see the results of DQ checks, using make meta

select * from ge_validations_store limit 1;
exit

Use make down to spin down containers.

Architecture

Data architecture