messamrashad / Data-Engineering-Project-Full-Development-Cycle

Udacity NanoDegree Capstone Project.
MIT License
1 stars 0 forks source link

Data Engineering Nanodegree - Capstone Project

Data Engineering Nanodegree Project - Airbnb Dataset

Introduction

This is the Capstone project for the Data Engineering Nanodegree Program from Udacity. Udacity grants to us the freedom to choose whether we will use a dataset from suggested datasets provided from Udacity or pick a dataset which matches our interests, and defining the scope by ourselves. In this project, I choose to build a project from scratch, but with a restriction that the chosen dataset must be BIG & COMPLEX to have the opportunity to apply what I have learned throughout the program. The dataset I will use is Airbnb actual data collected from this Inside Airbnb. The dataset contains all info about (Reviews, Calendars, Listings) of my favourite city in the whole world 'Amsterdam'.

Project Scope

The scope of this project is to read data from Amazon S3 bucket and load it in staging tables on Amazon Redshift, later process the data in order to create Fact and Dimension tables. Finally some data quality checks are applied to ensure if any constraints are not applied correctly on the ingested data.

The project follows the following steps:

Regarding the data model of this project, I will follow the Snowflake Schema as some of the relationships are many-to-many which is not supported by Star Schema.

Dataset

The dataset was found on the mentioned source and was compiled by Inside Airbnb Team on date 08 August 2019. This dataset could be used for Data Analytics / Data Engineering purposes or to apply NLP techniques.

Source Files

There are four CSV file which i will use in this project:

Storage

The files (after Exploring/Assessing Phase) will be uploaded to an S3 bucket. The total space of this S3 bucket will be approximately 1 GB.

Tooling

The used tools in this project were Apache Airflow, Amazon S3 Storage, Amazon Redshift.

I chose these technologies based on many factors:

1- For Apache Airflow, It's an open-source project, and its community is enormous which is very useful in case of unknown info or methodology. Moreover, It gives the developer the chance of creating new plugins based on his/her needs.

2-They have been illustrated in the Nanodegree Program.

Data Model

Data Model

Explanation of the datamodel:

Starting with the Dimensions Tables:

The Fact Table:

The Reference Table:

Data Dictionary

DIM_REVIEWS:

FIELD TYPE PK FK
REVIEW_ID INT YES
REVIEWER_ID INT
REVIEW_DATE DATE
REVIEW_SCORES_RATING FLOAT
COMMENT INT
LISTING_ID INT YES

DIM_PROPERTIES:

FIELD TYPE PK FK
LISTING_ID LONG YES
ROOM_TYPE VARCHAR
PROPERTY_TYPE VARCHAR
GUSTES_INCLUDED INT
CANCELLATION_POLICY INT
HOST_ID INT YES
NEIGHBOURHOOD_ID INT YES

DIM_CALENDARS:

FIELD TYPE PK FK
CALENDAR_ID INT YES
DATE DATE
AVAILABLE VARCHAR
ADJUSTED_PRICE INT
MAXIMUM_NIGHTS INT
MINIMUM_NIGHTS INT
LIST_ID INT YES

DIM_HOSTS:

FIELD TYPE PK FK
HOST_ID INT YES
HOST_NAME VARCHAR
HOST_URL VARCHAR
HOST_SINCE DATE
HOST_RESPONCE_TIME VARCHAR
IS_SUPERHOST VARCHAR
HOST_TOTAL_LISTINGS_COUNT INT
LISTING_ID INT

REF_NEIGHBOURHOODS:

FIELD TYPE PK FK
NEIGHBOURHOOD_ID INT YES
NEIGHBOURHOOD_NAME VARCHAR

Airbnb_Amst_Facts:

FIELD TYPE PK FK
FACT_ID INT YES
HOST_ID LONG YES
LIST_ID LONG YES
NUMBER_OF_REVIEWS INT
REVIEWS_RATING INT

Scenarios

The PROJECT RUBRIC has mentioned that the submitted project must address the following scenarios:

1- The data was increased by 100x.

2- The pipelines would be run daily by 7 am every day.

3- The database needed to be accessed by 100+ people.

Data Pipeline

Exploring and Assessing the data:

My ETL pipeline includes 18 tasks:

ETL PIPELINE

Data Processing

I used SQL statements to process the data from S3 bucket to already created tables on Redshift. For each task, there is an equivilant SQL statement does the data ingestion process smoothly with consideration of the performance with such big datasets.

Here is an example on how i ingest the data from STG tables into one of DIMENSION tables DIM_CALENDARS :

INSERT INTO DIM_CALENDARS 
        (CALENDAR_DATE, AVAILABLE, ADJUSTED_PRICE, MINIMUM_NIGHTS, MAXIMUM_NIGHTS, LISTING_ID)
        SELECT DISTINCT CALENDAR_DATE, AVAILABLE, ADJUSTED_PRICE, MINIMUM_NIGHTS, MAXIMUM_NIGHTS, LISTING_ID
        FROM STG_CALENDARS

How to run the project

1- Ensure that you have an Airflow instance is up and running.

2- Copy all the content of dags and plugins directories to your Airflow work environment.

3- On your Airflow GUI, under Admin drop-down list, select connections and create two connections with the following inputs:

4- Activate the DAG and run it.