Closed connor1q closed 2 years ago
That’s a great point – I am also somewhat wary of Jupyter in terms of version control – they aren’t particularly conducive to being part of an ongoing maintained codebase. However, they are fantastic tools for EDA, model building, report creation (I use R markdown for the same purpose) so they do have real value when used appropriately/securely.
EDIT: tidied up for clarity
Hi Connor, good to know! I've contacted the Reproducible Analytical Pathways champion at ONS for further clarification and some appropriate wording. I'll update once I hear back!
Jupyter notebooks need some effort to make them easy to share. They suffer from three main problems:
Our package, best-practice-and-impact/govcookiecutter, goes some way to alleviate the former issue by cleaning notebooks using pre-commit hooks first. However, we think that these issues are severe enough that teams releasing analysis should be wary about relying on Jupyter Notebook's solely.
They are useful (and I use them a lot) for exploratory analysis, with adequate safety measures. But I would not recommend that they be used in production.
I'd recommend removing reference to notebooks here.
As advised that reference has been pulled: we might want to have a subsection about notebooks in next year's version, but we can check in on the situation nearer the time.
Hi Andrew,
This line jumped out to me:
Using notebooks goes against the recommendations of the ONS best practice team: https://best-practice-and-impact.github.io/qa-of-code-guidance/core_programming.html#notebooks
They go so far as to say:
For our open code work in NHSD we do not use jupyter notebooks. There are a number of downsides (see here) but the biggest problem is the ease with which PIID can be leaked in a notebook.
For instance, if someone prints the top 10 rows of a database in a notebook and then version controls that notebook, anything contained in those rows will be put onto git. There are ways to try to secure this with githooks but that is a 'fail-unsafe' scenario rather than a 'fail-safe' scenario.
I think if you include the recommendation to use notebooks then it would be important to do some strong signposting of the data leakage risks.