nhsx / open-source-policy

Open Source Policy development for the NHS
Other
47 stars 11 forks source link

Notebooks #15

Closed connor1q closed 2 years ago

connor1q commented 2 years ago

Hi Andrew,

This line jumped out to me:

For those able to take advantage of them, open notebooks (like Jupyter) and reproducible analytical pipelines are the gold standard.

Using notebooks goes against the recommendations of the ONS best practice team: https://best-practice-and-impact.github.io/qa-of-code-guidance/core_programming.html#notebooks

They go so far as to say:

"In short, notebooks are not suitable for modularising analysis pipelines [...] they should not be used as the main method of actually generating outputs"

For our open code work in NHSD we do not use jupyter notebooks. There are a number of downsides (see here) but the biggest problem is the ease with which PIID can be leaked in a notebook.

For instance, if someone prints the top 10 rows of a database in a notebook and then version controls that notebook, anything contained in those rows will be put onto git. There are ways to try to secure this with githooks but that is a 'fail-unsafe' scenario rather than a 'fail-safe' scenario.

I think if you include the recommendation to use notebooks then it would be important to do some strong signposting of the data leakage risks.

wbryant commented 2 years ago

That’s a great point – I am also somewhat wary of Jupyter in terms of version control – they aren’t particularly conducive to being part of an ongoing maintained codebase. However, they are fantastic tools for EDA, model building, report creation (I use R markdown for the same purpose) so they do have real value when used appropriately/securely.

EDIT: tidied up for clarity

otlah commented 2 years ago

Hi Connor, good to know! I've contacted the Reproducible Analytical Pathways champion at ONS for further clarification and some appropriate wording. I'll update once I hear back!

alexander-newton commented 2 years ago

Jupyter notebooks need some effort to make them easy to share. They suffer from three main problems:

  1. It's very easy to disclose PII through them if you're not careful - as Connor alluded to.
  2. They require extra steps to version control well - you need to use tools like nbdime - and even when you've implemented the right tools they still don't get on with github's diff functionality.
  3. They pose challenges in maintaining code integrity as cells can be run out of order, yet rely on the order that they are run.

Our package, best-practice-and-impact/govcookiecutter, goes some way to alleviate the former issue by cleaning notebooks using pre-commit hooks first. However, we think that these issues are severe enough that teams releasing analysis should be wary about relying on Jupyter Notebook's solely.

They are useful (and I use them a lot) for exploratory analysis, with adequate safety measures. But I would not recommend that they be used in production.

I'd recommend removing reference to notebooks here.

otlah commented 2 years ago

As advised that reference has been pulled: we might want to have a subsection about notebooks in next year's version, but we can check in on the situation nearer the time.