Closed guidopetri closed 3 years ago
This should probably also all live under a folder for each project, of course - e.g. a folder for win_probability
.
One other thing that's unclear in CCDS: where do you train/test split? I suppose not all models need train/test splits (e.g. Bayesian models, maybe?), so that would be under model_training.py
... but then can we compare apples to apples with different train/test splits?
I don't think it matters that much in the end, especially with a random seed, but I wanted to point this out.
On second thought, I think the versioning should probably be at project level. Having it per-file would make sense for a more software engineering approach but when looking at a model I want to see the overview, not have to follow e.g. v3 and find out that it's using the v1 dataset.
Did a lot of work moving the v1 version of win probability to this paradigm, now I just have to move v2.
All code has been moved as of 5f231b0. Now for documenting changes and motivation behind the changes.
Documented in cbe729d and added a "Makefile" in 90a3598. I like this code structure.
When working on #40 , I realized that my analysis code in a
.ipynb
is not the best way to organize my code. Looking into best practices, it seems Cookiecutter Data Science (what a terrible name!) is rather common as far as project structures in data science go, and it's similar to runningrails new
for Ruby on Rails projects.I don't think I agree with building out a whole git repo structure complete with docs, environment files, references, tox, etc. but I do like some of the ideas, namely:
.env
file for secrets (right now, I'm just usinggetpass()
make
command that runs the analysis from start to finishSince this is a bit of a larger code refactor than just moving stuff into a new folder, I am creating this issue to track my work.
One other topic that doesn't seem to be covered by CCDS is different model versioning, or model iteration. I suspect the ideal structure for me (heavily basing myself on CCDS) would look like:
(where we just extend versioning to each of(where versioning is per-project, though this will involve a lot of repeated code and directories)src/features
,src/models
,src/visualization
)I don't think this is all necessary for the current code refactor, but I do want to separate my code out into versions and different
.py
files. Having a folder for notebooks could be useful but I don't think there would be anything in it after the refactor. Having top-level data and models folders is also probably useful (though they will be listed in the.gitignore
). The Makefile would be a nice to have, but I'd probably just change the CCDS Makefile and make it simpler.Summarizing, the tasks are:
.ipynb
s to.py
.env
or similar