Organize analysis code - Githubissues

guidopetri commented 3 years ago

When working on #40 , I realized that my analysis code in a .ipynb is not the best way to organize my code. Looking into best practices, it seems Cookiecutter Data Science (what a terrible name!) is rather common as far as project structures in data science go, and it's similar to running rails new for Ruby on Rails projects.

I don't think I agree with building out a whole git repo structure complete with docs, environment files, references, tox, etc. but I do like some of the ideas, namely:

Analysis is a DAG
Data is immutable
Notebooks are for exploration
Using a .env file for secrets (right now, I'm just using getpass()
Having a make command that runs the analysis from start to finish

Since this is a bit of a larger code refactor than just moving stuff into a new folder, I am creating this issue to track my work.

One other topic that doesn't seem to be covered by CCDS is different model versioning, or model iteration. I suspect the ideal structure for me (heavily basing myself on CCDS) would look like:

├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         and a short `-` delimited description, e.g. `01-initial-data-exploration`.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
└── src                <- Source code for use in this project.
    ├── __init__.py    <- Makes src a Python module
    │
    ├── data           <- Scripts to download or generate data
    │   └── create_dataset.py
    │
    ├── features       <- Scripts to turn raw data into features for modeling
    │   └── create_features.py
    │
    ├── models         <- Scripts to train models and then use trained models to make
    │   │                 predictions
    │   ├── model_training.py      <- includes train/test split, saves to processed data folder
    │   └── model_prediction.py    <- predicts on each split individually, saves to models folder
    │
    └── visualization  <- Scripts to create exploratory and results oriented visualizations
        └── create_viz.py

~~(where we just extend versioning to each of src/features, src/models, src/visualization)~~ (where versioning is per-project, though this will involve a lot of repeated code and directories)

I don't think this is all necessary for the current code refactor, but I do want to separate my code out into versions and different .py files. Having a folder for notebooks could be useful but I don't think there would be anything in it after the refactor. Having top-level data and models folders is also probably useful (though they will be listed in the .gitignore). The Makefile would be a nice to have, but I'd probably just change the CCDS Makefile and make it simpler.

Summarizing, the tasks are:

[x] Create folder structure
[x] Create required files
[x] Move code from .ipynbs to .py
[x] Refactor code to use .env or similar
[x] Document versioning and changes between model versions, as well as motivation
[x] (optional) Create Makefile

guidopetri commented 3 years ago

This should probably also all live under a folder for each project, of course - e.g. a folder for win_probability.

guidopetri commented 3 years ago

One other thing that's unclear in CCDS: where do you train/test split? I suppose not all models need train/test splits (e.g. Bayesian models, maybe?), so that would be under model_training.py... but then can we compare apples to apples with different train/test splits?

I don't think it matters that much in the end, especially with a random seed, but I wanted to point this out.

guidopetri commented 3 years ago

On second thought, I think the versioning should probably be at project level. Having it per-file would make sense for a more software engineering approach but when looking at a model I want to see the overview, not have to follow e.g. v3 and find out that it's using the v1 dataset.

guidopetri commented 3 years ago

Did a lot of work moving the v1 version of win probability to this paradigm, now I just have to move v2.

guidopetri commented 3 years ago

All code has been moved as of 5f231b0. Now for documenting changes and motivation behind the changes.

guidopetri commented 3 years ago

Documented in cbe729d and added a "Makefile" in 90a3598. I like this code structure.

guidopetri / chess-pipeline

Organize analysis code #62