project-engage / engage

SIG: Data Science Project Name: Engage Project Description: forecast COVID19 impact overtime given new data by city primarily US cites Project Objective: Primary: Provide a data science tool to predict how the COVID-19 will continue to infect or how it may die down overtime using weather related inputs by location Secondary: Open source project to train junior data scientists in using tools such as: Spark (pyspark) for large data handling Apache Airflow for orchestration Python Visualization using Python visualization module Help define a data science pipeline for Headstorm
0 stars 0 forks source link

Create docker and airflow image for running all code #13

Closed project-engage closed 4 years ago

project-engage commented 4 years ago

Create docker and airflow image for running all code:

Python

Deployment/orchestration

Docker

Apache Airflow

ETL:

PySpark

Automate external data source load

Database:

Postgresql (or MySQL) to host all data for model training and validation

Packaging:

Create Docker/docker compose image with the following softwares:

Python 3.7:

install all Python modules used in machine learning

install all modules used in API and web interface (Flask, etc)

Pyspark - automatically configured in the Docker build process

Apache Airflow latest version

Postgresql (or MySQL)

Flask for web interface for forecasting and API

Airflow DAGs: component of application to be managed by Airflow

Data load from external sources using PySpark

Data aggregation and shaping using PySpark

Features generation and model training using Python 3.7:

refresh model training every months

use available trained object for forecasting and simulations

API runner