pradpradprad / data-pipeline-using-aws-services-and-apache-airflow

Personal data engineering ETL pipeline project
0 stars 1 forks source link

๐Ÿš€ Data Pipeline Using AWS Services and Apache Airflow

This project leverages fundamental knowledge learned from various data engineering courses to build an end-to-end data pipeline that automates the ETL process for further analysis and reporting.

๐Ÿ“‘ Table of Contents

โœจ Overview

The project is designed to automate daily extraction of bike store transactional data. The data is then transformed by dropping duplicates and null values, extracting date components and structuring it into fact and dimension tables, aligned with a star schema in AWS Redshift. This approach optimizes performance for analysis in Power BI. The dashboard visualizes sales, products and customers data over a 2-year period from 2016 to 2018.

Objectives

๐Ÿ—๏ธ Architecture

  1. Data Extraction: Starting with data source, we use Amazon RDS with PostgreSQL engine as a source database. Extraction is performed using SQL to query against our database where we have aws_s3 extension installed.

  2. Data Storage: Once we have our desired data extracted, we use Amazon S3 buckets as staging area to store raw data before the transformation.

  3. Data Transformation: Next, we read data from raw data bucket using Apache Spark running on Amazon EMR Serverless to clean and transform data. The processed data is then written back to S3 bucket.

  4. Data Warehousing: Using Amazon Redshift Serverless as data warehouse to store processed data in dimensional model structure. The COPY command is used to load data from S3 to Redshift.

  5. Data Visualization: As business intelligence layer, we use Power BI to create dashboard and visualize data from data warehouse.

  6. Pipeline Orchestration: Apache Airflow running on Amazon EC2 instance is used to automate the data pipeline. Airflow DAG manage and schedule tasks to run daily.

  7. Access Control: We use AWS IAM to manage access and permissions. Roles and policies are configured to control permissions for accessing AWS services.

architecture

Pipeline Architecture

๐Ÿ—‚๏ธ Data Model

The dataset used in this project is publicly available from here.

  1. Data Source

    Source database structured as relational data model as shown in link provided above. Data consists of bike stores transactional data in the United States. For this project, we have modified some parts in original dataset including:

    • Created lookup table for order status from orders table.
    • Added audit columns such as created_at, updated_at and is_active in customers, staffs, products, stores and orders tables to support incremental data loading in extraction process.
    • Setup database triggers on tables to automatically update audit columns upon record insertion or update.
    • Established soft deletion policy using database rules to mark records as inactive rather than deleting them directly.

    database_model

    Database Model

  2. Data Destination

    AWS Redshift Serverless is used as a destination for our data, it consists of 2 layers separated by database schemas.

    • Staging Layer: This schema temporarily holds data loaded from the S3 bucket using COPY command. The staging tables allow for data comparisons with existing records in the serving layer to identify new or updated records. After records are moved into the serving layer, this layer is cleared.

    data_warehouse_staging

    Data Warehouse Staging Tables

    • Serving Layer: This layer structured as star schema, a dimensional data model, to leverage denormalized approach for read-optimized. Fact table contains measureable values, while dimension tables provide context. Once data is loaded into the staging layer, we use SQL scripts to compare data between layers to ensure that only new or modified data is inserted or updated.

    data_warehouse_model

    Data Warehouse Model

๐ŸŒŸ Airflow DAG

2 DAGs are included in this project, 1 DAG to setup environment and 1 DAG for our pipeline.

To start pipeline, Setup DAG must be run manually to prepare necessary resources. Once the setup is completed, Pipeline DAG is set to run automatically according to its schedule.

โš™๏ธ Setup

  1. Amazon S3

    We need to create 4 buckets.

    • source_bucket to store necessary project files.
    • log_bucket to store logs from Amazon EMR Serverless applications and Apache Airflow.
    • raw_bucket to store raw data from extraction process.
    • processed_bucket to store processed data from transformation process.

    Some buckets need to have folder structures created as follows:

    source_bucket
    โ”œโ”€โ”€โ”€dataset
    โ”‚   โ”œโ”€โ”€โ”€brands.csv
    โ”‚   โ”œโ”€โ”€โ”€categories.csv
    โ”‚   โ”œโ”€โ”€โ”€customers.csv
    โ”‚   โ”œโ”€โ”€โ”€order_items.csv
    โ”‚   โ”œโ”€โ”€โ”€orders.csv
    โ”‚   โ”œโ”€โ”€โ”€products.csv
    โ”‚   โ”œโ”€โ”€โ”€staffs.csv
    โ”‚   โ”œโ”€โ”€โ”€stocks.csv
    โ”‚   โ””โ”€โ”€โ”€stores.csv
    โ”œโ”€โ”€โ”€ec2
    โ”‚   โ””โ”€โ”€โ”€requirements.txt
    โ””โ”€โ”€โ”€script
        โ””โ”€โ”€โ”€spark_job.py

    Files in dataset folder can be downloaded from the link.

    log_bucket
    โ”œโ”€โ”€โ”€airflow-log
    โ””โ”€โ”€โ”€emr-log
  2. Amazon RDS

    Using PostgreSQL engine with public access enabled. Necessary security group and IAM roles are listed as follows.

    • Security Group

      • Add inbound rule type PostgreSQL with port range 5432.
    • IAM Role

      Role Policy Resource Action
      Import Role Import Policy S3 source_bucket s3:GetObject
      s3:ListBucket
      Export Role Export Policy S3 raw_bucket s3:PutObject
      s3:AbortMultipartUpload
  3. Amazon EMR Serverless

    Requirement for setting up EMR Serverless is creating an IAM role with necessary permissions for running spark job and writing logs.

    • IAM Role

      Role Policy Resource Action
      EMR Runtime Role EMR Runtime Policy S3 source_bucket, log_bucket, raw_bucket, processed_bucket s3:PutObject
      s3:GetObject
      s3:ListBucket
  4. Amazon Redshift Serverless

    For compute resource, we use 8 RPUs for the serverless workgroup. Necessary security group and IAM role are listed as follows.

    • Security Group

      • Add inbound rule type Redshift with port range 5439.
    • IAM Role

      Role Policy Resource Action
      Redshift Access Role S3 Access Policy S3 processed_bucket s3:GetObject
      s3:GetBucketLocation
      s3:ListBucket
      AmazonRedshiftAllCommandsFullAccess Policy AWS Managed Policy AWS Managed Policy
  5. Amazon EC2

    EC2 instance specifications are

    Ubuntu OS
    t3a.medium Instance Type
    2 vCPU, 4 GiB Memory
    8 GiB Storage

    Also create key pair to connect to your instance.

    • Security Group

      • Add inbound rule type SSH with port range 22.
      • Add inbound rule type Custom TCP with port range 8080 (for Apache Airflow).
    • IAM Role

      Role Policy Resource Action
      EC2 Role EMR policy Any EMR Serverless applications and jobruns All EMR Serverless actions
      IAM Policy Any role iam:PassRole
      iam:CreateServiceLinkedRole
      S3 Access Policy S3 source_bucket, log_bucket s3:PutObject
      s3:GetObject
      s3:ListBucket

    Setup EC2 instance for Airflow by running commands in ec2_setup.sh. Once Airflow starts running, we can connect to web UI by navigate to <EC2-public-IPv4-address>:8080 in browser.

๐Ÿ“Š Dashboard Summary

Full dashboard file located in dashboard folder