sparcopen / open-research-doathon

Open Research Data do-a-thon in London & Virtual - March 4th & 5th
Other
37 stars 12 forks source link

Analyze all Jupyter notebooks mentioned in PubMed Central #25

Closed Daniel-Mietchen closed 10 months ago

Daniel-Mietchen commented 7 years ago

Jupyter notebooks are a popular vehicle these days to share data science workflows. To get an idea of best practices in this regard, it would be good to analyze a good number of them in terms of their reproducibility and other aspects of usability (e.g. documentation, ease of reuse).

A search in PubMed Central (PMC) reveals the following results:

With currently just 102 hits, a systematic reanalysis seems entirely doable and could perhaps itself be documented by way of reproducible notebooks that might eventually end up being mentioned in PMC.

A good starting point here could be An open RNA-Seq data analysis pipeline tutorial with an example of reprocessing data from a recent Zika virus study, for which both a Jupyter notebook and a Docker image are available.

I plan to give a lightning talk on this. Some background is in this recent news piece.

rossmounce commented 7 years ago

A few notes...

With EuropePMC a search for ipynb OR jupyter gives 107 results: http://europepmc.org/search?query=jupyter%20OR%20ipynb

I find it extremely interesting to note that EuropePMC has the full text for 102 of these 107 articles/preprints

(jupyter OR ipynb) AND (HAS_FT:Y)

Which demonstrates that jupyter/IPython notebooks are almost exclusively associated with open-friendly journals(?) Or perhaps this is a bias influenced by legally-enforced inability to do full text search on 'closed' journals where jupyter/ipynb might be mentioned but can't be found by EuropePMC because they are not allowed.

rossmounce commented 7 years ago

Rcode to get bibliographic metadata on each of those 107 hits from EuropePMC:

install.packages('europepmc')
library(europepmc)
hits <- epmc_search(query='jupyter%20OR%20ipynb&synonym=TRUE',limit=200)
dim(hits)
names(hits)
write.csv(hits,file="107hits.csv")

I've also made available the resulting CSV as an editable spreadsheet via GDocs: https://docs.google.com/spreadsheets/d/1txg0u9zARHrLkY4MYuz5vmsVCZUOItgybEHqx13Bkbc/edit?usp=sharing

Perhaps with this sheet we can assign who takes responsibility for which papers?

Daniel-Mietchen commented 7 years ago

That's a great starting point — thanks!

npscience commented 7 years ago

+1 from me. Interested to contribute and to see the output.

Daniel-Mietchen commented 7 years ago

We've taken Ross' spreadsheet and added some columns for

The "Code in problem cell" column documents the notebook code causing the first problem, and the "Problem" column gives more details. So far, basically none of the notebooks ran through: We normally stopped after the first such error and went on to the next notebook, but for one rather complex notebook, we tried to go through to the end, which we have not reached yet.

Daniel-Mietchen commented 7 years ago

I've also added a column for the PMC URL to reduce the fiddling with URLs.

Daniel-Mietchen commented 7 years ago

I notified the Jupyter mailing list: https://groups.google.com/forum/#!topic/jupyter/6pQIarRmrsc .

mrw34 commented 7 years ago

Here's a write-up of our efforts:

https://markwoodbridge.com/2017/03/05/jupyter-reproducible-science.html

Many thanks to @Daniel-Mietchen for the original idea, and for all the help over the weekend!

Daniel-Mietchen commented 7 years ago

@mrw34 Thanks - I'll go right into it.

Daniel-Mietchen commented 7 years ago

I found one that actually ran through, albeit after a warning about an old kernel: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4940747/bin/13742_2016_135_MOESM3_ESM.ipynb . A very simple notebook to test a random number generator, but hey, it works!

To celebrate the event, I introduced color coding to the spreadsheet: red for cases where the run resulted in an error, green when it did not.

Daniel-Mietchen commented 7 years ago

Here's a notebook shared only as a screenshot, from a paper about reproducibility: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5014984/figure/figure1/ .

Just added yellow to the spreadsheet for cases like this where the notebook did not produce errors nor ran through, and where "n/a" is not applicable in the sense that there is no notebook.

Daniel-Mietchen commented 7 years ago

There is a nice "Ten simple rules" series in PLOS Computational Biology: http://collections.plos.org/ten-simple-rules . Perhaps we should do one on "how to share Jupyter notebooks"?

They already have Ten Simple Rules for Reproducible Computational Research and Ten Simple Rules for Cultivating Open Science and Collaborative R&D as well as other somewhat related articles, but none of them seem to touch upon Jupyter notebooks.

Daniel-Mietchen commented 7 years ago

Some comments relevant for here are also in https://github.com/sparcopen/open-research-doathon/issues/41#issuecomment-284239044 .

Daniel-Mietchen commented 7 years ago

The above close was just as part of the wrap-up of the doathon. I will keep working on it and document my progress over at https://github.com/Daniel-Mietchen/ideas/issues/2 .

rossmounce commented 7 years ago

@mrw34 @Daniel-Mietchen excellent write-up!

If licensing allows could you upload somewhere all the .ipynb notebooks you found that were related to those 107 papers?

Daniel-Mietchen commented 7 years ago

@rossmounce I plan to do that but haven't yet checked for licensing (added column AH for that).

The notebook URLs are in column AD, which currently has the following list:

Daniel-Mietchen commented 7 years ago

Mark's write-up is now up at https://markwoodbridge.com/2017/03/05/jupyter-reproducible-science.html .

Daniel-Mietchen commented 7 years ago

There is a validator tool for Jupyter notebooks: https://github.com/jupyter/nbformat/blob/master/nbformat/validator.py

Daniel-Mietchen commented 7 years ago

Here is a discussion about using Jupyter notebooks programmatically, with a useful demo.

Daniel-Mietchen commented 7 years ago

I am thinking of submitting this to JupyterCon — submission deadline March 14. Anyone in?

rossmounce commented 7 years ago

@Daniel-Mietchen I'd be happy to help you prepare the abstract submission, do a bit more analysis but I can't go to the meeting :) Does that count as 'in' ?

Daniel-Mietchen commented 7 years ago

That's "in enough" for my taste. Don't know whether I can go either, but the point is to reach out to the Jupyter community and to help do something about these issues, e.g. by refining the recommendations and perhaps offering some validation mechanism (think Schematron for XML).

mrw34 commented 7 years ago

I can help with setting up a minimal "best practice" git repo including a notebook, a requirements file, and a script to automate testing the notebook whenever it's pushed. This would work well with GitLab, who have a free CI and build-status notification service.

By the way, I'm not sure that "validation" has much of a role to play here, at least not in the sense of XML/JSON validation (i.e. nbformat) - we didn't find any notebooks that were syntactically incorrect. I think that execution (with nbconvert --execute) is likely the only way to test the notebooks for the failures we're interested in. Some kind of static analysis would obviously be interesting to categorise problems, but rather hard?

Daniel-Mietchen commented 7 years ago

@mrw34 that sounds great. Happy to use GitLab for that. Yes, "validation" is probably the wrong word - just feeling my way into this sphere.

Not sure about categorizing the problems yet, but I am aiming at fixing and documenting all issues with all of these notebooks over the next month or so.

rossmounce commented 7 years ago

tbh, should we really expect all 'found' .ipynb notebooks to be reproducible? I think not.

As an author of an .ipynb notebook associated with a preprint: https://peerj.com/preprints/773/

I can say I knew when I published it that it wouldn't be reproducible by others, but that wasn't the point. The content/data I was working on was under copyright and thus it couldn't be shared with the paper or notebook. Thus I knowingly and purposefully used an .ipynb notebook to document the metadata of what I had access to and the operations I ran on those copyrighted files I had access to, demonstrating transparency and "potential reproducibility" (if the reproducers had legitimate access to the same content I had).

We should probably bear edge-cases like this in mind before judging authors too harshly on the apparent widespread phenomenon of non-reproducible .ipynb notebooks. Full reproducibility is not easy!

mrw34 commented 7 years ago

Yep, would be interesting to see how many in the list are like that.

Ultimately the key is probably providing a zero-friction way of authors to opt-into continuous verification of notebooks where appropriate.

Daniel-Mietchen commented 7 years ago

@rossmounce @mrw34

I agree - we should not expect all Jupyter notebooks to be reproducible, since there are always exceptions like the ones you outlined. Some of these may be avoidable (e.g. some dependencies on proprietary libraries), others not (e.g. datasets that just are not public).

For me, the point of the exercise was to document the extent to which the notebooks are reproducible (both individually and as a group), as well as the barriers that pop up on the way, and to think about lowering those barriers for notebooks shared in the future.

Daniel-Mietchen commented 7 years ago

I am reopening this, since it seems we're still active here.

tompollard commented 7 years ago

@Daniel-Mietchen let me know if/how I can assist with this. I'm based in Boston, so could probably make it down to JupyterCon if needed.

mrw34 commented 7 years ago

Repository created: https://gitlab.com/mwoodbri/jupyter-ci/tree/master

Verification is performed using nbconvert --execute.

Daniel-Mietchen commented 7 years ago

@mrw34 Thanks - that looks promising. Have never pushed anything to GitLab, but will try as soon as possible.

How do you deal with "could not find kernel for Python 2" type errors?

mrw34 commented 7 years ago

@Daniel-Mietchen It assumes Python 3. The author would need to specify Python 2 in the .yaml file if necessary.

A third-party service would need to deduce the correct version from the .Ipynb file, which I think is possible, at least for v4 format notebooks.

Daniel-Mietchen commented 7 years ago

Just found out about a project that aims specifically at promoting the use of Jupyter notebooks for reproducible science: https://github.com/Reproducible-Science-Curriculum .

Daniel-Mietchen commented 7 years ago

While our PubMed search found a number of false positives (i.e. papers with no associated notebook), there are also some false negatives (i.e. papers with an associated notebook that do not show up in the search).

I just found this example: http://nbviewer.jupyter.org/gist/pschloss/9815766/notebook.ipynb , which is associated with https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4139711/ but not mentioned there.

mrw34 commented 7 years ago

Hmm. Looks like an issue with PubMed version. Source does include reference to notebook: http://www.nature.com/nature/journal/v509/n7500/full/nature13178.html

jangondol commented 7 years ago

@mrw34 says: "The Dockerfile refers to and downloads a couple of tarballs from third-party sources. This is perfectly acceptable, especially as some dependencies may ultimately not be redistributable, and it reduces the size of the source repository, but does assume that these links remain valid in the long term."

Before the web at large fully embraces content-based addressing, we could formulate a recommendation to the authors to use the Wayback Machine or a similar mechanism. So when referencing tarballs etc., instead of referring to the live copy on the web (which is subject to bitrot), the authors should refer to a version saved in the Web Archive which is less ephemeral (not completely permanent but better than linking to the live content). There are some downsides (e.g. if people use "naive" upload to web archive via their website, large files will time out) but it could generally work.

But what about those who do NOT follow the recommendation? Well, one approach would be to write a bot (cc: @Daniel-Mietchen) that crawls new notebooks and auto-saves external dependencies to the Web Archive. Another simple run-time tool could be also written that would pre-process Dockerfiles, check for (missing) external dependencies and update the links with the Web Archive reference. Perhaps the Archive Team could also help with this?

This is just initial thinking about possible ways to address the issue of broken dependencies. If you like the idea, feel free to open a new issue and develop this further.

schmelling commented 7 years ago

I did not read through all of it, but from Mark's blogpost I assume that the issue it not yet closed.

"There’s no means for notebooks to directly express the dependencies they require."

Sebastian Raschka @rasbt wrote a Jupyter magic extension to document dependencies etc. https://github.com/rasbt/watermark I use this when I publish my notebooks, see https://github.com/schmelling/reciprocal_BLAST/blob/master/notebooks/2_KaiABC_BLAST_Heatmap.ipynb for example.

I hope that is useful to you.

Cheers, Nic

ketch commented 7 years ago

If you'd like to point to an example of a repo that is doing continuous integration testing of notebooks, here is one (shameless self-promotion):

https://github.com/clawpack/riemann_book

It's not for a paper; we're writing a book in Jupyter notebooks. We have a reasonably complex chain of dependencies so setting up continuous integration is non-trivial. We're not currently testing anything for correctness; we're just testing that all notebooks run without errors in Python 2 and 3. See

https://github.com/clawpack/riemann_book/blob/master/test.py

for the code that executes the notebooks.

mrw34 commented 7 years ago

@ketch Nice! The scikit-bio cookbook takes a similar approach: https://github.com/biocore/scikit-bio-cookbook

rasbt commented 7 years ago

@ketch using something similar as well for my repos (e.g., https://github.com/rasbt/deep-learning-book/blob/master/code/tests/test_notebooks.py) and where I have documentation stored Jupyter Notebooks that gets converted to a Markdown/HTML for the websites.

I have not experimented with "real" unit tests, yet though. So far, it is just checking whether a cell throws an error or not (which will cause the build to fail via Travis CI). However, it would probably be better to somehow check whether e.g,. Py 2.7 and 3.6 produce the same output in the cells (i.e., wrt integer division and stuff). I guess that's more involved though ... has anyone tried that, yet?

ketch commented 7 years ago

@rasbt Sounds like we are at the same stage in this. There are a couple of packages designed to help test notebooks:

https://github.com/computationalmodelling/nbval https://github.com/bollwyvl/nosebook https://gist.github.com/minrk/2620735

I haven't tried any of them yet.

rasbt commented 7 years ago

@ketch Thanks for the reference! The script looks need; have to give it a try one day (was almost tempted to write it from scratch ;))

Daniel-Mietchen commented 7 years ago

@rossmounce @npscience @mrw34 @tompollard @jangondol @rasbt @ketch @schmelling . The deadline for submissions to Jupyter Con is about 1 day away, so I will work on it tonight.

Pointers in https://github.com/Daniel-Mietchen/events/blob/master/JupyterCon-2017.md and the associated Gdoc. You're all most welcome to join in.

mrw34 commented 7 years ago

Another potential forum for presentation: https://www.software.ac.uk/c4rr/cfp

Daniel-Mietchen commented 7 years ago

Couldn't get to it earlier, but am in the Gdoc right now to draft a submission (6h left). Happy to look into other contexts for presentation as well.

Daniel-Mietchen commented 7 years ago

The PMC / Europe PMC numbers are up to 111 and 113 hits, respectively.

Daniel-Mietchen commented 7 years ago

I'm done with a first pass for all form fields. Will read Mark's post again and see whether that gives reason to change anything.

Daniel-Mietchen commented 7 years ago

There are currently 73 hits for the Google search site:arxiv.org+ipynb .

Daniel-Mietchen commented 7 years ago

I have submitted: https://github.com/Daniel-Mietchen/events/blob/master/JupyterCon-2017.md .

Daniel-Mietchen commented 7 years ago

An additional article for which Jupyter notebooks are available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4428227/ . It states

The empirical and simulated data, as well as the python code used for simulation and analyses, have been deposited in the Dryad repository (datadrayd.org; doi:10.5061/dryad.ht0hs upon publication).

At that DOI, there is a ZIP file described as

IPython notebooks containing the simulation and analyses code for the manuscript (named simulations.ipynb and analyses.ipynb respectively) and empirical and simulated data used in the manuscript.