kevkanaan / PureSphere

Data engineering project
Apache License 2.0
1 stars 0 forks source link

PureSphere

Project made by 3 INSA Lyon students for the OT7-Data Engineering course of Ricardo TOMMASSINI :

Initialization

Note First time only.

Create the necessary folders:

mkdir -p ./dags ./logs ./plugins ./config

Setting the right Airflow user:

echo -e "AIRFLOW_UID=$(id -u)" > .env

Initialize the database:

docker compose up airflow-init

Run Airflow

docker compose up

The webserver is available at: http://localhost:8080. The default account has the login airflow and the password airflow. To connect to Postgres database using pgAdmin, the username is airflow.

If you want to run the pipeline offline, make sure:

Then, simply skip the ingestion pipeline as it is not made to run offline and starts directly from the wrangling pipeline.

Set up Spark connection

Within Airflow webserver, create a new connection to Spark. To do so:

Run commands

You can run CLI commands, but you have to do it in one of the defined airflow-* services, ex:

docker compose run airflow-worker airflow info

Dependencies

We can add Python dependencies through the requirements.txt file. Keep in mind that when a new dependency is added, you have to rebuild the image. See the Airflow documentation for more information.

Quick overview of the project

The goal of the project is to implement a full stack data pipeline to answer 2-3 questions formulated in natural language.

We chose the following questions, focusing on France in 2021:

To answer them, we use 3 datasets: Géorisques, Hub'eau and Géod'air

For more information about a dataset, you can look at its README in the /dags/<dataset> folder.

To read the whole project report, it's here

Data

The retrieved data are stored in the /data folder. The data are stored in 3 "zones":

Pipelines

The pipelines are defined in the /dags folder:

  1. ingest.py: responsible to bring raw data to the landing zone. It takes approximately 15min to run.
  2. wrangle.py: responsible to migrate raw data from the landing zone and move them into the staging area (cleaning, wrangling, transformation, etc.). Again, it takes approximately 15min to run.
  3. production.py: responsible to move the data from the staging zone into the production zone, and trigger the update of data marts (views). The data mart consist of a SQL table located in Postgres production db and a toy graph database stored in Neo4J. Build the whole graph requires hours, so we have restricted it to 1000 relations for the sake of illustration.

Useful links

Service URL
airflow localhost:8080
spark-master localhost:9090
Neo4J Browser localhost:7474
Neo4J DB localhost:7687