This is meant for use when you are:
Google Colaboratory
notebook instances or from Amazon SageMaker
2019-{project-name}-{data-science}
This contains Dockerfile templates in different flavors for getting started
on the data science parts of a HackOregon
project.
1) master
branch contains basic Python based dependencies
2) R
branch contains R-based dependencies
3) MLflow-py
for experimental Python workflow that uses MLflow
4) others coming soon
cookercutter
Python
: help set up Sphinx
for extracting docstring documentation about the APIs R
: help set up KnitR
and ROxygen2
for extracting the comments from
different parts of the R codePython
: We recommend to use one of the pytest
or unittest
frameworks AWS SageMaker
Google Cloud Colaboratory
├── LICENSE
├── build <- all the files needed to build the code dependencies
│ ├── Makefile <- Makefile with commands like `make data` or `make train`
│ ├── requirements.txt <- The requirements file for reproducing the analysis
│ │ environment, generated with `pip freeze > requirements.txt`
│ ├── docker-compose.yml<- The docker-compose file starting resources
│ └── Dockerfile <- The dockerfile that uses requirements.txt file.
│
├── README.md <- The top-level README for developers using this project.
│
├── data <- You are encouraged to include links to metadata
│ ├── 1_raw <- Original raw data dump.
│ ├── 2_interim <- Intermediate data that has been transformed,
│ │ recommended format for relational datais parquet.
│ └── 3_processed <- The final, canonical data sets for modeling.
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ the creator's initials, and a short `-` delimited description, e.g.
│ `1.0-jqp-initial-data-exploration`.
│
├── references <- Manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
│
├── setup.py <- makes project pip installable (pip install -e .) so src can be imported
├── src <- Source code for use in this project.
│ ├── __init__.py <- Makes src a Python module
│ │
│ ├── data <- Scripts to download or generate data
│ │ └── make_dataset.py
│ │
│ ├── features <- Scripts to turn raw data into features for modeling
│ │ └── build_features.py
│ │
│ ├── models <- Scripts to train models and then use trained models to make
│ │ │ predictions
│ │ ├── predict_model.py
│ │ └── train_model.py
│ │
│ └── visualization <- Scripts to create exploratory and results oriented visualizations
│ └── visualize.py
│
└── tox.ini <- tox file with settings for running tox; see tox.testrun.org
Project based on the cookiecutter data science project template. #cookiecutterdatascience
raw-data = hacko-data-archive
clean-data = ? # in the future
from sagemaker import get_execution_role
role = get_execution_role()
bucket = 'hacko-data-archieve'
# example data key, change this
data_key = '2018-neighborhood-development/JSON/pdx_bicycle/pdx_bike_counts.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
output_location = 's3://{}/{}'.format(bucket, data_key)
We may spin up allow sagemaker
instances for projects with big compute and / or data needs.
PROJECTNAME_AUTHOR_NAME
https://github.com/hackoregon/data-science-pet-containers
Put your credentials in
~/.aws/credential