Deliverance Data Engineering Project

This repository contains the code and infrastructure for the Totesys Data Engineering project, which aims to build a reliable and resilient data pipeline to extract, transform, and load data from an operational database into a data lake and data warehouse hosted on AWS.

Presentation

Documentation

Please find the documentation for the project and code here

Project Architecture

Project Overview

The primary objective of this project is to showcase skills and knowledge in Python, SQL, database modeling, AWS, operational practices, and Agile methodologies. The project involves the following key components:

Data Ingestion: A Python application running on AWS Lambda that continually ingests data from the totesys database and stores it in an S3 "ingestion" bucket.
Data Processing: Another Python application running on AWS Lambda that remodels the ingested data into a predefined schema suitable for a data warehouse and stores the processed data in Parquet format in an S3 "processed" bucket.
Data Loading: A Python application running on AWS Lambda that loads the processed data into a prepared data warehouse hosted on AWS at defined intervals.
Monitoring and Alerting: Comprehensive logging, monitoring, and alerting mechanisms using AWS CloudWatch to track the progress, detect failures, and trigger email notifications.
Data Visualization: A QuickSight dashboard that displays useful data from the data warehouse.
Infrastructure as Code: Automated deployment of the entire infrastructure using Terraform and a CI/CD pipeline with GitHub Actions. Dev, test and prod environments.

Step Function Flow

The flow diagram of the step function is shown below.

Repository Structure

The repository is organized as follows:

.
├── Makefile
├── README.md
├── conventions
│   ├── ci-cd.md
│   ├── code-review.md
│   ├── docs-and-comments.md
│   ├── images
│   ├── pull-request.md
│   ├── terraform.md
│   └── testing.md
├── db
│   ├── connection.py
│   ├── data
│   ├── run_schema.py
│   ├── run_seed.py
│   ├── schema.sql
│   └── seed.py
├── dev-db-terraform
│   ├── dev_db.tf
│   ├── main.tf
│   └── ...
├── python
│   ├── src
│   └── tests
├── requirements.in
├── specifications
│   ├── Deliverance_ETL_architecture_diagram.png
│   ├── Deliverance_ETL_architecture_diagram.svg
│   ├── S3_Data_Storage_Specification.md
│   ├── ingestion_lambda_spec.md
│   ├── project_plan.md
│   ├── specifiction.md
│   └── processing_lambda_spec.md
└── terraform
    ├── data.tf
    ├── dev.tfvars
    ├── eventbridge.tf
    ├── iam.tf
    ├── lambda.tf
    ├── main.tf
    ├── prod.tfvars
    ├── s3.tf
    ├── test.tfvars
    └── variables.tf

terraform/: Contains Terraform configuration files for provisioning the AWS infrastructure.
python/: Contains the source code for the Python Lambda functions responsible for data ingestion, processing, and loading. Includes unit tests.
.github/workflows/: Contains GitHub Actions workflows for continuous integration and deployment.
README.md: This file, providing an overview of the project and instructions for setup and deployment.

Getting Started

To get started with the project, follow these steps:

Clone the repository: git clone https://github.com/your-username/totesys-data-engineering.git
Install the required dependencies (e.g., Terraform, AWS CLI, Python, etc.).
Configure your AWS credentials and set up the necessary IAM roles and policies.
Customize the Terraform configuration files in the terraform/ directory to match your AWS account and desired settings.
Deploy the infrastructure using Terraform: terraform init and terraform apply.
Set up the CI/CD pipeline by configuring the GitHub Actions workflows in the .github/workflows/ directory.
Commit and push your changes to the repository to trigger the CI/CD pipeline and deploy the Lambda functions.
Monitor the pipeline execution and check the CloudWatch logs for any issues or failures.
Once the deployment is successful, you can trigger the data ingestion process and observe the data flow through the pipeline.

Contributing

Contributions to this project are welcome. If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.

License

This project is licensed under the MIT License.

Acknowledgments

Northcoders for providing the project specification and guidance.
AWS Documentation for the comprehensive documentation on AWS services.
Terraform Documentation for the Terraform documentation and examples.

For more information refer to the documentation.

millipz / nc-de-deliverance-project

readme