Code for blog at Data Engineering Best Practices - #1. Data flow & Code
Assume we are extracting customer and order information from upstream sources and creating a daily report of the number of orders.
If you'd like to code along, you'll need
Prerequisite:
docker ps
Run the following commands via the terminal. If you are using Windows, use WSL to set up Ubuntu and run the following commands via that terminal.
git clone https://github.com/josephmachado/data_engineering_best_practices.git
cd data_engineering_best_practices
make up # Spin up containers
make ddl # Create tables & views
make ci # Run checks & tests
make etl # Run etl
make spark-sh # Spark shell to check created tables
spark.sql("select partition from adventureworks.sales_mart group by 1").show() // should be the number of times you ran `make etl`
spark.sql("select count(*) from businessintelligence.sales_mart").show() // 59
spark.sql("select count(*) from adventureworks.dim_customer").show() // 1000 * num of etl runs
spark.sql("select count(*) from adventureworks.fct_orders").show() // 10000 * num of etl runs
:q // Quit scala shell
You can see the results of DQ checks, using make meta
select * from ge_validations_store limit 1;
exit
Use make down
to spin down containers.