Project made by 3 INSA Lyon students for the OT7-Data Engineering course of Ricardo TOMMASSINI :
Note First time only.
Create the necessary folders:
mkdir -p ./dags ./logs ./plugins ./config
Setting the right Airflow user:
echo -e "AIRFLOW_UID=$(id -u)" > .env
Initialize the database:
docker compose up airflow-init
docker compose up
The webserver is available at: http://localhost:8080. The default account has the login airflow
and the password airflow
. To connect to Postgres database using pgAdmin, the username is airflow
.
If you want to run the pipeline offline, make sure:
analysispc2021
and stationpc.csv
Then, simply skip the ingestion pipeline as it is not made to run offline and starts directly from the wrangling pipeline.
Within Airflow webserver, create a new connection to Spark. To do so:
spark-conn
Spark
spark://spark-master
7077
You can run CLI commands, but you have to do it in one of the defined airflow-*
services, ex:
docker compose run airflow-worker airflow info
We can add Python dependencies through the requirements.txt
file. Keep in mind that when a new dependency is added, you have to rebuild the image. See the Airflow documentation for more information.
The goal of the project is to implement a full stack data pipeline to answer 2-3 questions formulated in natural language.
We chose the following questions, focusing on France in 2021:
To answer them, we use 3 datasets: Géorisques, Hub'eau and Géod'air
For more information about a dataset, you can look at its README in the /dags/<dataset>
folder.
To read the whole project report, it's here
The retrieved data are stored in the /data
folder. The data are stored in 3 "zones":
The pipelines are defined in the /dags
folder:
ingest.py
: responsible to bring raw data to the landing zone. It takes approximately 15min to run.wrangle.py
: responsible to migrate raw data from the landing zone and move them into the staging area (cleaning, wrangling, transformation, etc.). Again, it takes approximately 15min to run.production.py
: responsible to move the data from the staging zone into the production zone, and trigger the update of data marts (views). The data mart consist of a SQL table located in Postgres production
db and a toy graph database stored in Neo4J. Build the whole graph requires hours, so we have restricted it to 1000 relations for the sake of illustration.Service | URL |
---|---|
airflow | localhost:8080 |
spark-master | localhost:9090 |
Neo4J Browser | localhost:7474 |
Neo4J DB | localhost:7687 |