hackalog / easydata

A flexible template for doing reproducible data science in Python.
MIT License
108 stars 22 forks source link

Including optional testing infrastructure for notebooks (and CI boilerplate in general) #134

Open acwooding opened 4 years ago

acwooding commented 4 years ago

Copied from my work in progress post (easier than re-writing it):

Have you ever run someone else's notebook only to get stymied by a variable name that isn't defined? This problem comes down to the fact that notebooks aren't typically run in a linear fashion. Non-linearity is an amazing benefit when developing code, but a terrible inconvenience if you go to run someone else's notebook and you run into an undefined variable name. It's easy to check in a notebook with a changed variable name (the old name is still active in your notebook but not that of your future user), or a variable name that is used before it is defined.

How do you prevent this? Tricky. As a general practice, we have a human-based solution: Kernel -> Restart & Clear Output, then Cell -> Run All. This will catch those errors. But it depends on the human remembering to do this. I do this religiously and still forget on occasion. Better to do this as part of your CI.

But then you need CI running on the repo where you're developing notebooks. (Which you should also do in the name of reproducibility!)

XXX include a discussion on possible options for running notebooks in testing suites.

And then there's the whole nightmare of checking in notebooks so that they're not a snotty mess for revision control. We use Kernel -> Restart & Clear Output before ever checking in a notebook.

XXX but that doesn't play well with testing for notebooks.

hackalog commented 4 years ago

I should note, we already include nbval, so assuming you have checked-in a notebook with output values:

py.test --nbval notebook_name.ipynb

should test the notebook to ensure that the output doesn't change

acwooding commented 4 years ago

what about the revision control issue (for notebooks, checking in output makes diffs insane...I suppose that's why you use notebook-based git diff which we also have installed)? Also, what about the outputs of randomized algorithms?

hackalog commented 4 years ago

For randomized algorithms, you should be specifying a seed. We used a trick elsewhere where we generated an Experiment (which included the PRNGs), and used that experiment class to generate seeds for our various trials. I'll see if I can dig it up

hackalog commented 4 years ago

If the output of the notebook is generally the same, running clear and run all (then checking in) might be the best approach. Obviously, we don't need to run nbval on all the notebooks, but we should have a mechanism to do so. (make test_notebooks, or some such?)