Closed Daniel-Mietchen closed 10 months ago
A few notes...
With EuropePMC a search for ipynb OR jupyter gives 107 results: http://europepmc.org/search?query=jupyter%20OR%20ipynb
I find it extremely interesting to note that EuropePMC has the full text for 102 of these 107 articles/preprints
(jupyter OR ipynb) AND (HAS_FT:Y)
Which demonstrates that jupyter/IPython notebooks are almost exclusively associated with open-friendly journals(?) Or perhaps this is a bias influenced by legally-enforced inability to do full text search on 'closed' journals where jupyter/ipynb might be mentioned but can't be found by EuropePMC because they are not allowed.
Rcode to get bibliographic metadata on each of those 107 hits from EuropePMC:
install.packages('europepmc')
library(europepmc)
hits <- epmc_search(query='jupyter%20OR%20ipynb&synonym=TRUE',limit=200)
dim(hits)
names(hits)
write.csv(hits,file="107hits.csv")
I've also made available the resulting CSV as an editable spreadsheet via GDocs: https://docs.google.com/spreadsheets/d/1txg0u9zARHrLkY4MYuz5vmsVCZUOItgybEHqx13Bkbc/edit?usp=sharing
Perhaps with this sheet we can assign who takes responsibility for which papers?
That's a great starting point — thanks!
+1 from me. Interested to contribute and to see the output.
We've taken Ross' spreadsheet and added some columns for
The "Code in problem cell" column documents the notebook code causing the first problem, and the "Problem" column gives more details. So far, basically none of the notebooks ran through: We normally stopped after the first such error and went on to the next notebook, but for one rather complex notebook, we tried to go through to the end, which we have not reached yet.
I've also added a column for the PMC URL to reduce the fiddling with URLs.
I notified the Jupyter mailing list: https://groups.google.com/forum/#!topic/jupyter/6pQIarRmrsc .
Here's a write-up of our efforts:
https://markwoodbridge.com/2017/03/05/jupyter-reproducible-science.html
Many thanks to @Daniel-Mietchen for the original idea, and for all the help over the weekend!
@mrw34 Thanks - I'll go right into it.
I found one that actually ran through, albeit after a warning about an old kernel: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4940747/bin/13742_2016_135_MOESM3_ESM.ipynb . A very simple notebook to test a random number generator, but hey, it works!
To celebrate the event, I introduced color coding to the spreadsheet: red for cases where the run resulted in an error, green when it did not.
Here's a notebook shared only as a screenshot, from a paper about reproducibility: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5014984/figure/figure1/ .
Just added yellow to the spreadsheet for cases like this where the notebook did not produce errors nor ran through, and where "n/a" is not applicable in the sense that there is no notebook.
There is a nice "Ten simple rules" series in PLOS Computational Biology: http://collections.plos.org/ten-simple-rules . Perhaps we should do one on "how to share Jupyter notebooks"?
They already have Ten Simple Rules for Reproducible Computational Research and Ten Simple Rules for Cultivating Open Science and Collaborative R&D as well as other somewhat related articles, but none of them seem to touch upon Jupyter notebooks.
Some comments relevant for here are also in https://github.com/sparcopen/open-research-doathon/issues/41#issuecomment-284239044 .
The above close was just as part of the wrap-up of the doathon. I will keep working on it and document my progress over at https://github.com/Daniel-Mietchen/ideas/issues/2 .
@mrw34 @Daniel-Mietchen excellent write-up!
If licensing allows could you upload somewhere all the .ipynb notebooks you found that were related to those 107 papers?
@rossmounce I plan to do that but haven't yet checked for licensing (added column AH for that).
The notebook URLs are in column AD, which currently has the following list:
Mark's write-up is now up at https://markwoodbridge.com/2017/03/05/jupyter-reproducible-science.html .
There is a validator tool for Jupyter notebooks: https://github.com/jupyter/nbformat/blob/master/nbformat/validator.py
I am thinking of submitting this to JupyterCon — submission deadline March 14. Anyone in?
@Daniel-Mietchen I'd be happy to help you prepare the abstract submission, do a bit more analysis but I can't go to the meeting :) Does that count as 'in' ?
That's "in enough" for my taste. Don't know whether I can go either, but the point is to reach out to the Jupyter community and to help do something about these issues, e.g. by refining the recommendations and perhaps offering some validation mechanism (think Schematron for XML).
I can help with setting up a minimal "best practice" git repo including a notebook, a requirements file, and a script to automate testing the notebook whenever it's pushed. This would work well with GitLab, who have a free CI and build-status notification service.
By the way, I'm not sure that "validation" has much of a role to play here, at least not in the sense of XML/JSON validation (i.e. nbformat
) - we didn't find any notebooks that were syntactically incorrect. I think that execution (with nbconvert --execute
) is likely the only way to test the notebooks for the failures we're interested in. Some kind of static analysis would obviously be interesting to categorise problems, but rather hard?
@mrw34 that sounds great. Happy to use GitLab for that. Yes, "validation" is probably the wrong word - just feeling my way into this sphere.
Not sure about categorizing the problems yet, but I am aiming at fixing and documenting all issues with all of these notebooks over the next month or so.
tbh, should we really expect all 'found' .ipynb notebooks to be reproducible? I think not.
As an author of an .ipynb notebook associated with a preprint: https://peerj.com/preprints/773/
I can say I knew when I published it that it wouldn't be reproducible by others, but that wasn't the point. The content/data I was working on was under copyright and thus it couldn't be shared with the paper or notebook. Thus I knowingly and purposefully used an .ipynb notebook to document the metadata of what I had access to and the operations I ran on those copyrighted files I had access to, demonstrating transparency and "potential reproducibility" (if the reproducers had legitimate access to the same content I had).
We should probably bear edge-cases like this in mind before judging authors too harshly on the apparent widespread phenomenon of non-reproducible .ipynb notebooks. Full reproducibility is not easy!
Yep, would be interesting to see how many in the list are like that.
Ultimately the key is probably providing a zero-friction way of authors to opt-into continuous verification of notebooks where appropriate.
@rossmounce @mrw34
I agree - we should not expect all Jupyter notebooks to be reproducible, since there are always exceptions like the ones you outlined. Some of these may be avoidable (e.g. some dependencies on proprietary libraries), others not (e.g. datasets that just are not public).
For me, the point of the exercise was to document the extent to which the notebooks are reproducible (both individually and as a group), as well as the barriers that pop up on the way, and to think about lowering those barriers for notebooks shared in the future.
I am reopening this, since it seems we're still active here.
@Daniel-Mietchen let me know if/how I can assist with this. I'm based in Boston, so could probably make it down to JupyterCon if needed.
Repository created: https://gitlab.com/mwoodbri/jupyter-ci/tree/master
Verification is performed using nbconvert --execute
.
@mrw34 Thanks - that looks promising. Have never pushed anything to GitLab, but will try as soon as possible.
How do you deal with "could not find kernel for Python 2" type errors?
@Daniel-Mietchen It assumes Python 3. The author would need to specify Python 2 in the .yaml file if necessary.
A third-party service would need to deduce the correct version from the .Ipynb file, which I think is possible, at least for v4 format notebooks.
Just found out about a project that aims specifically at promoting the use of Jupyter notebooks for reproducible science: https://github.com/Reproducible-Science-Curriculum .
While our PubMed search found a number of false positives (i.e. papers with no associated notebook), there are also some false negatives (i.e. papers with an associated notebook that do not show up in the search).
I just found this example: http://nbviewer.jupyter.org/gist/pschloss/9815766/notebook.ipynb , which is associated with https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4139711/ but not mentioned there.
Hmm. Looks like an issue with PubMed version. Source does include reference to notebook: http://www.nature.com/nature/journal/v509/n7500/full/nature13178.html
@mrw34 says: "The Dockerfile refers to and downloads a couple of tarballs from third-party sources. This is perfectly acceptable, especially as some dependencies may ultimately not be redistributable, and it reduces the size of the source repository, but does assume that these links remain valid in the long term."
Before the web at large fully embraces content-based addressing, we could formulate a recommendation to the authors to use the Wayback Machine or a similar mechanism. So when referencing tarballs etc., instead of referring to the live copy on the web (which is subject to bitrot), the authors should refer to a version saved in the Web Archive which is less ephemeral (not completely permanent but better than linking to the live content). There are some downsides (e.g. if people use "naive" upload to web archive via their website, large files will time out) but it could generally work.
But what about those who do NOT follow the recommendation? Well, one approach would be to write a bot (cc: @Daniel-Mietchen) that crawls new notebooks and auto-saves external dependencies to the Web Archive. Another simple run-time tool could be also written that would pre-process Dockerfiles, check for (missing) external dependencies and update the links with the Web Archive reference. Perhaps the Archive Team could also help with this?
This is just initial thinking about possible ways to address the issue of broken dependencies. If you like the idea, feel free to open a new issue and develop this further.
I did not read through all of it, but from Mark's blogpost I assume that the issue it not yet closed.
"There’s no means for notebooks to directly express the dependencies they require."
Sebastian Raschka @rasbt wrote a Jupyter magic extension to document dependencies etc. https://github.com/rasbt/watermark I use this when I publish my notebooks, see https://github.com/schmelling/reciprocal_BLAST/blob/master/notebooks/2_KaiABC_BLAST_Heatmap.ipynb for example.
I hope that is useful to you.
Cheers, Nic
If you'd like to point to an example of a repo that is doing continuous integration testing of notebooks, here is one (shameless self-promotion):
https://github.com/clawpack/riemann_book
It's not for a paper; we're writing a book in Jupyter notebooks. We have a reasonably complex chain of dependencies so setting up continuous integration is non-trivial. We're not currently testing anything for correctness; we're just testing that all notebooks run without errors in Python 2 and 3. See
https://github.com/clawpack/riemann_book/blob/master/test.py
for the code that executes the notebooks.
@ketch Nice! The scikit-bio cookbook takes a similar approach: https://github.com/biocore/scikit-bio-cookbook
@ketch using something similar as well for my repos (e.g., https://github.com/rasbt/deep-learning-book/blob/master/code/tests/test_notebooks.py) and where I have documentation stored Jupyter Notebooks that gets converted to a Markdown/HTML for the websites.
I have not experimented with "real" unit tests, yet though. So far, it is just checking whether a cell throws an error or not (which will cause the build to fail via Travis CI). However, it would probably be better to somehow check whether e.g,. Py 2.7 and 3.6 produce the same output in the cells (i.e., wrt integer division and stuff). I guess that's more involved though ... has anyone tried that, yet?
@rasbt Sounds like we are at the same stage in this. There are a couple of packages designed to help test notebooks:
https://github.com/computationalmodelling/nbval https://github.com/bollwyvl/nosebook https://gist.github.com/minrk/2620735
I haven't tried any of them yet.
@ketch Thanks for the reference! The script looks need; have to give it a try one day (was almost tempted to write it from scratch ;))
@rossmounce @npscience @mrw34 @tompollard @jangondol @rasbt @ketch @schmelling . The deadline for submissions to Jupyter Con is about 1 day away, so I will work on it tonight.
Pointers in https://github.com/Daniel-Mietchen/events/blob/master/JupyterCon-2017.md and the associated Gdoc. You're all most welcome to join in.
Another potential forum for presentation: https://www.software.ac.uk/c4rr/cfp
Couldn't get to it earlier, but am in the Gdoc right now to draft a submission (6h left). Happy to look into other contexts for presentation as well.
The PMC / Europe PMC numbers are up to 111 and 113 hits, respectively.
I'm done with a first pass for all form fields. Will read Mark's post again and see whether that gives reason to change anything.
There are currently 73 hits for the Google search site:arxiv.org+ipynb .
I have submitted: https://github.com/Daniel-Mietchen/events/blob/master/JupyterCon-2017.md .
An additional article for which Jupyter notebooks are available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4428227/ . It states
The empirical and simulated data, as well as the python code used for simulation and analyses, have been deposited in the Dryad repository (datadrayd.org; doi:10.5061/dryad.ht0hs upon publication).
At that DOI, there is a ZIP file described as
IPython notebooks containing the simulation and analyses code for the manuscript (named simulations.ipynb and analyses.ipynb respectively) and empirical and simulated data used in the manuscript.
Jupyter notebooks are a popular vehicle these days to share data science workflows. To get an idea of best practices in this regard, it would be good to analyze a good number of them in terms of their reproducibility and other aspects of usability (e.g. documentation, ease of reuse).
A search in PubMed Central (PMC) reveals the following results:
With currently just 102 hits, a systematic reanalysis seems entirely doable and could perhaps itself be documented by way of reproducible notebooks that might eventually end up being mentioned in PMC.
A good starting point here could be An open RNA-Seq data analysis pipeline tutorial with an example of reprocessing data from a recent Zika virus study, for which both a Jupyter notebook and a Docker image are available.
I plan to give a lightning talk on this. Some background is in this recent news piece.