SIG: Data Science Project Name: Engage Project Description: forecast COVID19 impact overtime given new data by city primarily US cites Project Objective: Primary: Provide a data science tool to predict how the COVID-19 will continue to infect or how it may die down overtime using weather related inputs by location Secondary: Open source project to train junior data scientists in using tools such as: Spark (pyspark) for large data handling Apache Airflow for orchestration Python Visualization using Python visualization module Help define a data science pipeline for Headstorm
0
stars
0
forks
source link
Create docker and airflow image for running all code #13
Create docker and airflow image for running all code:
Python
Deployment/orchestration
Docker
Apache Airflow
ETL:
PySpark
Automate external data source load
Database:
Postgresql (or MySQL) to host all data for model training and validation
Packaging:
Create Docker/docker compose image with the following softwares:
Python 3.7:
install all Python modules used in machine learning
install all modules used in API and web interface (Flask, etc)
Pyspark - automatically configured in the Docker build process
Apache Airflow latest version
Postgresql (or MySQL)
Flask for web interface for forecasting and API
Airflow DAGs: component of application to be managed by Airflow
Data load from external sources using PySpark
Data aggregation and shaping using PySpark
Features generation and model training using Python 3.7:
refresh model training every months
use available trained object for forecasting and simulations
API runner