experiments: reduce gap between notebook and pipeline

iterative / example-repos-dev

Source code and generator scripts for example DVC projects

https://dvc.org/doc

21 stars 13 forks source link

experiments: reduce gap between notebook and pipeline #165

Closed dberenbaum closed 1 year ago

dberenbaum commented 1 year ago

It feels to me like too big a gap between the notebook and pipeline stages. I don’t know how we explain to users to go from one to the other. Do we need an intermediate stage, like one where the pipeline is just executing the notebook, or where we just dump the notebook into a script and run it? Also, can we bridge the gap a little by adding a little more (more model params, custom eval steps) to the notebook or maybe taking away some complexity from the final pipeline?

dberenbaum commented 1 year ago

From @alex000kim:

Every time this topic comes up with customers or on sales calls, my answer goes some thing like: “Copy-paste from notebook into python files -> refactor some code to make python files executable + consolidate all parameters into one file -> profit!”

dberenbaum commented 1 year ago

This can be driven by writing the docs, which should give a better idea of how to transition between steps in the tutorial.

alex000kim commented 1 year ago

💯 It's not a coincidence that "From Jupyter Notebook to DVC pipeline for reproducible ML experiments" became our most popular blog post in a short time.

daavoo commented 1 year ago

I think a video for this would be great. There are some Ipython magic tricks to smooth the process but all feel artificial/unrealistic compared with the reality of most notebooks.

dberenbaum commented 1 year ago

This blog post has good takes on this transition that can give some ideas for how to approach this:

https://towardsdatascience.com/from-jupyter-notebook-to-sc-582978d3c0c

I lean towards breaking this tutorial down into 3 parts like:

Notebooks: Start iterating fast.
Modularize into scripts and add a pipeline: This allows you to make each experiment reproducible, isolating its code and parameters in a commit and breaking down the code into steps.
Parametrize: This makes it easy to iterate fast, adjusting parameters from the CLI without touching the code.

dberenbaum commented 1 year ago

A lot of the gap between the notebook and the scripts is that the scripts separate training and evaluation, while they are tightly coupled in the notebook. WDYT about creating a script like train_and_eval.py that basically dumps the training and eval code from the notebook into a script?

It could be an intermediate commit/step towards the full pipeline, where the modularization process looks like:

Notebook
Prepare->train_and_eval pipeline
Prepare->train->eval pipeline

Besides being a more gradual transition, it might also show a semi-realistic example of iteration that's not just hyperparameter tuning. It shows one way in which your code or even your stages may change, and how DVC captures those changes.

Thoughts @daavoo @alex000kim?

alex000kim commented 1 year ago

My personal approach would be more like this:

Messy initial version of a notebook just to get some POC working
Refactored/cleaned up notebook where code is organized into logical sections, and most of the code is in functions (with most utility functions extracted into separate .py files and imported into the notebook)
Prepare->train->eval pipeline

Regardless of the steps, are you suggesting that we have more granular snapshots (git tags) of the whole process, not just the 1-notebook-dvclive and 2-dvc-pipeline. Or do you mean something else?

alex000kim commented 1 year ago

Although, now that I think about it, most DL frameworks have some train/test split implemented as part of their data loaders (e.g. via some valid_pct parameter). So some real projects I've seen don't do data split into train/test followed by separate train and eval stages. Instead, they load all the data into a data loader, let it do the splitting, and then they do training and eval within the same training script.