hackoregon / 2019-sandbox-data-science

The cookiecutter repository for sandbox data science.
MIT License
0 stars 0 forks source link

Purpose

This is meant for use when you are:

  1. setting up a GitHub data science project structure locally
  2. extracting and reproducing the software setup from Google Colaboratory notebook instances or from Amazon SageMaker

Naming convention for Hack Oregon data science github projects

Different versions of Data Science docker templates

This contains Dockerfile templates in different flavors for getting started on the data science parts of a HackOregon project.

1) master branch contains basic Python based dependencies 2) R branch contains R-based dependencies 3) MLflow-py for experimental Python workflow that uses MLflow 4) others coming soon

What the template does:

  1. set up a recommended folder structure with cookercutter
  2. set up library dependencies for extracting documentation as a website
    • Python: help set up Sphinx for extracting docstring documentation about the APIs
    • R: help set up KnitR and ROxygen2 for extracting the comments from different parts of the R code
  3. set up testing infrastructure for validating the correctness of the code
    • Python: We recommend to use one of the pytest or unittest frameworks
  4. Reproduce library setup from Cloud-based notebook instances e.g.

Recommended folder structure

    ├── LICENSE
    ├── build             <- all the files needed to build the code dependencies
    │   ├── Makefile          <- Makefile with commands like `make data` or `make train`
    │   ├── requirements.txt  <- The requirements file for reproducing the analysis 
    │   │                environment, generated with `pip freeze > requirements.txt`
    │   ├── docker-compose.yml<- The docker-compose file starting resources 
    │   └── Dockerfile        <- The dockerfile that uses requirements.txt file.
    │
    ├── README.md             <- The top-level README for developers using this project.
    │
    ├── data              <- You are encouraged to include links to metadata
    │   ├── 1_raw             <-  Original raw data dump.
    │   ├── 2_interim         <- Intermediate data that has been transformed, 
    │   │                recommended format for relational datais parquet.
    │   └── 3_processed       <- The final, canonical data sets for modeling.
    │
    ├── docs                  <- A default Sphinx project; see sphinx-doc.org for details
    │
    ├── models                <- Trained and serialized models, model predictions, or model summaries
    │
    ├── notebooks             <- Jupyter notebooks. Naming convention is a number (for ordering),
    │                            the creator's initials, and a short `-` delimited description, e.g.
    │                            `1.0-jqp-initial-data-exploration`.
    │
    ├── references            <- Manuals, and all other explanatory materials.
    │
    ├── reports               <- Generated analysis as HTML, PDF, LaTeX, etc.
    │   └── figures           <- Generated graphics and figures to be used in reporting
    │
    │
    ├── setup.py              <- makes project pip installable (pip install -e .) so src can be imported
    ├── src                   <- Source code for use in this project.
    │   ├── __init__.py       <- Makes src a Python module
    │   │
    │   ├── data              <- Scripts to download or generate data
    │   │   └── make_dataset.py
    │   │
    │   ├── features       <- Scripts to turn raw data into features for modeling
    │   │   └── build_features.py
    │   │
    │   ├── models         <- Scripts to train models and then use trained models to make
    │   │   │                 predictions
    │   │   ├── predict_model.py
    │   │   └── train_model.py
    │   │
    │   └── visualization  <- Scripts to create exploratory and results oriented visualizations
    │       └── visualize.py
    │
    └── tox.ini            <- tox file with settings for running tox; see tox.testrun.org

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Data storage in our public S3 bucket

raw-data = hacko-data-archive clean-data = ? # in the future

Storing non-sensitive data to S3 data buckets

Getting non-sensitive data from S3 data buckets

from sagemaker import get_execution_role

role = get_execution_role()
bucket = 'hacko-data-archieve'
# example data key, change this
data_key = '2018-neighborhood-development/JSON/pdx_bicycle/pdx_bike_counts.csv'

data_location = 's3://{}/{}'.format(bucket, data_key)
output_location = 's3://{}/{}'.format(bucket, data_key)

SageMaker

We may spin up allow sagemaker instances for projects with big compute and / or data needs.

Past version of the Docker container template

https://github.com/hackoregon/data-science-pet-containers

Using AWS using CLI

Put your credentials in

~/.aws/credential