pycontw / pycon-etl

11 stars 8 forks source link

PyConTW ETL

Python CI Docker Image CI

Using Airflow to implement our ETL pipelines.

Table of Contents

Prerequisites

Installation

There are several tools available to create a virtual environment in Python.

Below are the steps to manage a virtual environment using venv:

  1. Create a Virtual Environment

    To create a virtual environment, run the following command:

    python -m venv venv

    In this example, venv is the name of the virtual environment directory, but you can replace it with any name you prefer.

  2. Activate the Virtual Environment

    After creating the virtual environment, activate it using the following command:

    source venv/bin/activate
  3. Install Dependencies

    After activating the virtual environment, you can install the required dependencies:

    # Install airflow and dev dependencies
    pip install -r requirements.txt -r requirements-dev.txt -c constraints-3.8.txt
    
    # black is conflict with click, so install it separately
    pip install black==19.10b0 click==7.1.2
  4. Deactivate the Virtual Environment

    When you're done working in the virtual environment, you can deactivate it with:

    deactivate

Configuration

  1. For development or testing, run cp .env.template .env.staging. For production, run cp .env.template .env.production.

  2. Follow the instructions in .env.<staging|production> and fill in your secrets. If you are running the staging instance for development as a sandbox and do not need to access any specific third-party services, leaving .env.staging as-is should be fine.

Contact the maintainer if you don't have these secrets.

⚠ WARNING: About .env Please do not use the .env file for local development, as it might affect the production tables.

BigQuery (Optional)

Set up the Authentication for GCP: https://googleapis.dev/python/google-api-core/latest/auth.html *After running gcloud auth application-default login, you will get a credentials.json file located at $HOME/.config/gcloud/application_default_credentials.json. Run export GOOGLE_APPLICATION_CREDENTIALS="/path/to/keyfile.json" if you have it.

Running the Project

If you are a developer 👨‍💻, please check the Contributing Guide.

If you are a maintainer 👨‍🔧, please check the Maintenance Guide.

Local Environment with Docker

For development/testing:

# Build the local dev/test image
make build-dev

# Start dev/test services
make deploy-dev

# Stop dev/test services
make down-dev

The difference between production and dev/test compose files is that the dev/test compose file uses a locally built image, while the production compose file uses the image from Docker Hub.

If you are a authorized maintainer, you can pull the image from the GCP Artifact Registry.

Docker client must be configured to use the GCP Artifact Registry.

gcloud auth configure-docker asia-east1-docker.pkg.dev

Then, pull the image:

docker pull asia-east1-docker.pkg.dev/pycontw-225217/data-team/pycon-etl:{tag}

There are several tags available:

Production

Please check the Production Deployment Guide.

Contact

PyCon TW Volunteer Data Team - Discord