Closed dberenbaum closed 1 year ago
From @alex000kim:
Every time this topic comes up with customers or on sales calls, my answer goes some thing like: “Copy-paste from notebook into python files -> refactor some code to make python files executable + consolidate all parameters into one file -> profit!”
This can be driven by writing the docs, which should give a better idea of how to transition between steps in the tutorial.
💯 It's not a coincidence that "From Jupyter Notebook to DVC pipeline for reproducible ML experiments" became our most popular blog post in a short time.
I think a video for this would be great. There are some Ipython magic tricks to smooth the process but all feel artificial/unrealistic compared with the reality of most notebooks.
This blog post has good takes on this transition that can give some ideas for how to approach this:
I lean towards breaking this tutorial down into 3 parts like:
A lot of the gap between the notebook and the scripts is that the scripts separate training and evaluation, while they are tightly coupled in the notebook. WDYT about creating a script like train_and_eval.py
that basically dumps the training and eval code from the notebook into a script?
It could be an intermediate commit/step towards the full pipeline, where the modularization process looks like:
Besides being a more gradual transition, it might also show a semi-realistic example of iteration that's not just hyperparameter tuning. It shows one way in which your code or even your stages may change, and how DVC captures those changes.
Thoughts @daavoo @alex000kim?
My personal approach would be more like this:
Regardless of the steps, are you suggesting that we have more granular snapshots (git tags) of the whole process, not just the 1-notebook-dvclive
and 2-dvc-pipeline
.
Or do you mean something else?
Although, now that I think about it, most DL frameworks have some train/test split implemented as part of their data loaders (e.g. via some valid_pct
parameter). So some real projects I've seen don't do data split into train/test followed by separate train and eval stages.
Instead, they load all the data into a data loader, let it do the splitting, and then they do training and eval within the same training script.
It feels to me like too big a gap between the notebook and pipeline stages. I don’t know how we explain to users to go from one to the other. Do we need an intermediate stage, like one where the pipeline is just executing the notebook, or where we just dump the notebook into a script and run it? Also, can we bridge the gap a little by adding a little more (more model params, custom eval steps) to the notebook or maybe taking away some complexity from the final pipeline?