Closed Daniel-Mietchen closed 10 months ago
The original PMC query
now gives 115 results, which is over 10% growth within less than a month, so our project here seems timely.
One of the additional results is Detecting High-Order Epistasis in Nonlinear Genotype-Phenotype Maps, with an accompanying Jupyter notebook in https://github.com/harmslab/notebooks-nonlinear-high-order-epistasis and a Binder version at http://mybinder.org/repo/harmslab/notebooks-nonlinear-high-order-epistasis , both apparently produced for the preprint version of the paper. I just ran the notebook for their Figure 2, which qualitatively reproduces all three components of Fig. 2 in the final paper, though with interesting deviations in terms of presentation - namely axis scale, axis labels, coloring - that I haven't yet looked at in detail.
I have had good success with re-running Binder versions of notebooks in the past. The first one I really dove into was this one — couldn't resist to give it another try right now, and it still worked fine.
Some useful background on Binder is in http://ivory.idyll.org/blog/2016-mybinder.html and http://tritemio.github.io/smbits/2016/03/22/binder-fretbursts/ .
Sadly, a PMC search for
currently does not bring up any results.
I also tried http://mybinder.org/repo/margudo/lssgalpy from https://arxiv.org/abs/1702.04268 , which gave "failed!", i.e. the container couldn't be created.
Just came across https://twitter.com/o_guest/status/832532349988589568 , which links to https://openlab-flowers.inria.fr/uploads/default/original/1X/65addc14bb2a6a7feaf7690865fa3708d5b0990f.pdf , which contains a set of brief articles around computational reproducibility.
A very nice initiative to put rep*bility to a real test!
Unfortunately I am not surprised by the outcome. For two similar recent experiments (though not about notebooks) with similar results, see https://github.com/ReScience/ReScience/issues/43 and https://f1000research.com/articles/6-124/v1#referee-response-20292.
Note that, contrary to frequent claims, notebooks are not a repbility tool at all, but a documentation. In terms of repbility, notebooks are exactly equivalent to the Python script that you obtain by stringing together the code cells. They are perhaps even a bit worse, because it is easier (and thus done with less hesitation) to insert shell commands which almost always fail on another machine.
The problems that you encountered are exactly those that motivated my ActivePapers project: (1) keeping track of dependencies (code AND data) is too difficult with today's tools, and therefore mistakes are the norm rather than the execption, and (2) for code, knowing the dependencies is not sufficient to get them to work with reasonable effort. For the details see a recent Webinar about rep*bility issues in scientific publication.
The recommendations given at the end of your summary are probably the best one can give today, but they won't fix the problem as long as there are no support tools to help with respecting them. Enabling rep*ble results with reasonable effort from the authors requires a toolchain that encourages/enforces good habits right from the start, long before publication comes into the authors' focus. We don't have such a toolchain today.
An aspect that's worth discussing in the context is that notebooks actually encourage "bad" behavior today. I think that can be fixed, but not before people become aware of it. The fundamental problem with notebooks is that they address two different tasks and in doing so they obscure in the mind of scientists the very different priorities that these two tasks require. One task is performing interactive data analysis, the other task is publishing a polished narrative about this analysis once it is done. You want maximum flexibility in the first task, including "shelling out", using local Python modules under development, and whatever else it takes to explore your data quickly. But you need to get rid of all that when editing a notebook into a narrative, and there is not enough support for this today.
A simple notebook validator could help a lot here: scan a notebook for shell commands and for access to files outside the current working directory, and put a clearly visible warning on the code cells that are affected. One step up would be to scan for Python modules that are not properly installed in some virtual environment. Once you know that all your code is in a single virtual environment and that all your data inside a single directory, it becomes much easier to package everything for publication.
Python modules that are not properly installed in some virtual environment.
This was exactly my problem in https://twitter.com/ThomasArildsen/status/844688697920491520
It just occured to me that some of the verification code in the ActivePapers Python edition could be adapted to do this kind of scan during the execution of a notebook (or of a plain Python script). In terms of usage, this would require something like
import reproducibility
reproducibility.set_project_directory("/path/to/my/data")
at the beginning of the script/notebook. Moreover, such a package could provide a list of all Python modules that were used with something like
reproducibility.show_used_modules()
at the end of the script/notebook. Alternatively, it could raise an exception when modules outside of the current virtual environment are used.
As always with Python, there is no way to make such checks absolutely certain, in particular since they cannot check what happens in extension modules. But it might still be much better than nothing. Opinions?
Thanks, @khinsen and @ThomasA for chiming in here. The pointers in your posts sent me off a long reading chain this morning, which provided me with additional perspectives on these issues — thanks for that.
Your ActivePapers project looks interesting, and using something like this for some basic verification would probably be a good step forward — have you seen @mrw34's CI for Jupyter verification?
@khinsen a Ph.D. student of mine, @Chroxvi, has also implemented some functionality in our Magni package that sounds a bit similar: https://github.com/SIP-AAU/Magni/tree/master/magni/reproducibility - http://magni.readthedocs.io/en/latest/magni.reproducibility.html
@Daniel-Mietchen Yes, I have seen this and other CI-based approaches. They are great when applicable but frustrating when they aren't - as all technology-based solutions. For example, @mrw34's approch requires all code dependencies to be pip-installable, all data dependencies to be downloadable or part of the repository, and the whole analysis to run within whatever resource limitations GitLab's CI imposes.
ActivePapers takes an inverse approach to CI: guarantee that if the computation runs to completion, it is also reproducible. This is particularly valuable for long-running computations. Of course ActivePapers has technological limitations as well, in particular the restriction to pure Python code.
The parts that could be useful outside of the full ActivePapers framework are the ones that restrict the modules one is allowed to import and the data files one is allowed to access.
@ThomasA Yes, the code by @Chroxvi explores a similar approach. From the docs it seems to be a bit specialized to your environment. Do you think it could be generalized (not relying on Magni, not relying on Conda)?
@khinsen One of the motivations behind our verification exercise here was to come up with recommendations for how to share software, and Jupyter notebooks in particular.
Having "all data dependencies to be downloadable or part of the repository" seems like a good recommendation to me, and being able to show that your a particular piece of software complies with that criterion is helpful. Yes, there are cases when such an approach is not applicable, but I still think such a recommendation would be a better recommendation than the lack of recommendations that we currently have.
As for pip-installability, this obviously makes sense only for Python dependencies, while Jupyter notebooks can contain code or dependencies in multiple other languages, and there are of course several Python package managers. Still, I think it would be good if more of the code shared through articles in PMC that is pip-installable in principle could actually demonstrate this pip-installability, and if people intending to share their code (e.g. as per https://github.com/BjornFJohansson/pygenome/issues/1 ) were made more aware of these issues and the tools available, the tools and their usage could be improved in response to community needs, e.g. beyond pip or GitLab.
As an aside, I just came across the "Repeatability in Computer Science" project (from 2015) over at http://reproducibility.cs.arizona.edu/ , which set a much lower bar for replicability than we did here but had similar observations: http://reproducibility.cs.arizona.edu/v2/index.html . I assume this is known to some people in this thread - just adding it in here to keep all reproducbility-related information in this repo in one place. I know the thread is getting long and unwieldy, so I'm also thinking of branching it out somewhere, so as to provide a simpler way of getting an overview of the state we are at.
@Daniel-Mietchen First of all, I didn't mean my list of restrictions of @mrw34's approach as a criticism. Every technology has limitations. If there's anything to criticize, it's that the restrictions are not listed explicitly, leaving the task of figuring them out to potential users.
Having all dependencies downloadable or packaged with the notebook is indeed a decent compromise given today's state of the art. Recommending it is OK in a domain where it is most probably applicable. The same goes for pip-installability, although I'd expect its applicability to be limited almost everywhere, given how many Python packages depend on C libraries.
The problem with downloadable data in the long run is that it requires baking in a URL into the notebook. Five years from now, that URL is probably stale. More stable references, such as DOIs, don't permit direct downloading today. So today everyone has to choose between ease of replication and long-term availability. It isn't obvious that one or the other choice is to be preferred in general.
I live in a world where datasets are a few GB in size and processing requires a few hours using 10 to 100 processors on a parallel computer. These machines often have network restrictions that make downloading data from the Internet impossible. I mention this just to illustrate that no recommendations can ever be absolute - there is too much diversity in scientific computing.
Hi. Daniel pointed me at this thread.
When you say that "DOIs, don't permit direct downloading today" I'm not sure if the word "permit" here is in reference to:
the convention that Crossref/DataCite DOIs tend to resolve to a landing page instead of 'the thing itself."
Or both.
As far as I can tell, parties appear unanimous on the fact that, barring privacy/confidentiality issues, data should be made available openly. Access control issues should be minimal.
And DOIs would allow direct downloading today if, when possible and practical, those registering DOI metadata included links to direct downloads in their Crossref or DataCite metadata. At Crossref we are increasingly seeing publishers registering full text links in their metadata. For examples, see the text-mining
links in these metadata records from PeerJ:
@gbilder My reference was to the landing-page issue. It's good to hear that people are discussion solutions, but as far as I know it remains impractical today to download a dataset automatically given a DOI.
@khinsen @gbilder I guess best-practice here involves assigning DOIs to datasets (and not just their parent publication, if any), and resolving any ambiguity over if/how the data itself can be automatically retrieved given the relevant DOI. Lots more on the latter here: https://jcheminf.springeropen.com/articles/10.1186/s13321-015-0081-7
I was just pinged about another validation tool:
Here's an interesting blog post on what can go wrong in terms of reproducibility (with a focus on R): http://blog.appsilondatascience.com/rstats/2017/03/28/reproducible-research-when-your-results-cant-be-reproduced.html .
A demo for the original issue: https://www.ncbi.nlm.nih.gov/pubmed/27583132 If you click on the LinkOut link, would adding the Jupyter and Docker links here be helpful?
More on reproducibility, including a few Jupyter mentions: https://www.practicereproducibleresearch.org
@DCGenomics I think making the Jupyter, Docker, mybinder etc. versions of the code more discoverable is useful in principle, but conventional LinkOut (which is not shown by default) may not be the best mechanism to do this.
What I could imagine is a mechanism similar to the way images are currently being presented in PubMed, i.e. something that is shown by default if the paper comes with code shared in a standard fashion. That standard would have to be defined, though.
While necessary for reproducibility, discoverability alone is not sufficient, and this example paper highlights that, as explained in Mark's initial write-up.
There is a JupyterCon talk about citing Jupyter notebooks. I have contacted the speakers.
I'm not going to read all of this cause it's long. But, this is a cool idea and neat dataset. @eseiver mentioned you wanted to know how to read in notebooks, for which you can use nbformat
specifically the nbformat.read()
which you should probably use inside a context manager.
At the WikiCite 2017 hackathon today, we made some further progress in terms of making this analysis itself more reproducible — a Jupyter notebook that runs the Jupyter notebooks listed in our Google spreadsheet and spits out the first error message: http://paws-public.wmflabs.org/paws-public/995/WikiCite%20notebook%20validator.ipynb . @mpacer - yes, it makes use of nbformat.read()
We also looked at Jupyter notebooks cited from Wikipedia — notes at https://meta.wikimedia.org/wiki/WikiCite_2017/Jupyter_notebooks_on_Wikimedia_sites .
Hello! I'm part of the team that's working on beta.mybinder.org and related stuff, and am extremely interested in the idea of a 'badge that represents that your code is reproducible, and has been reproduced by CI'. Funnily, I also built the PAWS stuff, which is awesome to find in completely unrelated contexts :D
The part of the stack I've been attacking right now is the 'how do we reproduce the environment that the analysis took place in', as part of the mybinder work. You can see the project used for that here: https://github.com/jupyter/repo2docker. It takes a git repository and converts it into a Docker image, using conventions that should be easy to use for most people (and does not require them to understand or use Docker unless they want to). It's what powers the building bits of mybinder :)
As part of the CI for that project, you can see that we also build and validate some external repositories that are popular! We just represent these as YAML files here: https://github.com/jupyter/repo2docker/tree/master/tests/external and have them auto test on push so we make sure we can keep building them. This can be inverted too - in the repo's CI they can use repo2docker to make sure their changes don't break the build.
The part where we haven't made much progress yet is in actual validation. nbval mentioned here seems to be the one I like most - it integrates into pytest! We can possibly integrate repo2docker into pytest too, and use that to easily validate repos? Lots of possible avenues to work towards :)
One of the things I'd love to have is something like what https://www.ssllabs.com/ssltest/analyze.html?d=beta.mybinder.org does for HTTPS on websites - scores you on a bunch of factors, with clear ways of improving it. Doing something like that for git repos with notebooks would be great, and I believe we can do a fair amount of work towards it now.
I'll also be at JupyterCon giving a few talks, and would love to meet up if any of you are going to be there!
/ccing @choldgraf who also does a lot of these things with me :)
Hi!
@Daniel-Mietchen pointed me at this thread/project yesterday, and it seems quite interesting.
I wonder if it makes sense to think about short term and long term reproducibility for notebooks?
By short term, I mean that the notebook might depend on a python package that has to be installed, which could be done by pip before running the notebook, and this step could be automated by a notebook launcher perhaps.
And long term meaning that at some point, the dependency will not work, pip will be replaced by something new, etc., and the only way to solve this is to capture the full environment. This seems similar to what @yuvipanda describes, and what @tanumalik is trying to do a bit differently in https://arxiv.org/abs/1707.05731 (though I don't think her code is available). And long term here might still have OS requirements, so maybe I really mean medium term.
Also, I thought I would cc some other people who I think will be interested in this topic, and could perhaps point to other work done in the context of making notebooks reproducible: @fperez @labarba @jennybc @katyhuff
@khinsen - sorry I lost track of this thread back in March... Yes, I think @Chroxvi's reproducibility
can be carved out of the Magni package and I actually hope to do that sometime this fall. I hope to do that along with @ppeder08's validation
(https://github.com/SIP-AAU/Magni/tree/master/magni/utils/validation) which can be used for in-/output-validation of for example somewhat more abstract "data types" than Python's built-in types.
Hi Dan,
Our code is available from https://bitbucket.org/geotrust/sciunit-cli
Current documentation is available from http://geotrusthub.org/geotrust_html/GeoTrust.html
Yes ours is restricted to Linux OS for now. We have a bare bones version for Mac OS X that is not in production use.
We are currently working on enabling reproducibility for workflows and Jupyter notebooks through application virtualization. We need some more work to capture standard I/O.
Tanu
On Tue, Aug 8, 2017 at 2:17 AM, Thomas Arildsen notifications@github.com wrote:
@khinsen https://github.com/khinsen - sorry I lost track of this thread back in March... Yes, I think @Chroxvi https://github.com/chroxvi's reproducibility can be carved out of the Magni package and I actually hope to do that sometime this fall. I hope to do that along with @ppeder08 https://github.com/ppeder08's validation (https://github.com/SIP-AAU/ Magni/tree/master/magni/utils/validation) which can be used for in-/output-validation of for example somewhat more abstract "data types" than Python's built-in types.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sparcopen/open-research-doathon/issues/25#issuecomment-320872028, or mute the thread https://github.com/notifications/unsubscribe-auth/ANq42YxHXdtU_GCCGLGot-JRIZV9vo0hks5sWAuTgaJpZM4MMOSq .
Thanks for the additional comments. I have proposed to work on this further during the Wikimania hackathon: https://phabricator.wikimedia.org/T172848 .
I got sick during the hackathon and haven't fully recoverd, but JupyterCon is just days away, so I have started to distill the discussion here into an outline for the talk next Friday: https://github.com/Daniel-Mietchen/events/blob/master/JupyterCon-2017.md#outline-of-the-talk . Will work on it from Tuesday onwards, and your contributions to this are as always welcome.
After chatting with @Daniel-Mietchen about this idea, we've implemented web app to autorun notebooks mentioned in the paper.
Just add list of papers' URLs, like https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5322252/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4965465/ and run the executability validator.
It is a pre-pre-pre-alpha version done for fun and in the name of reproducibility. Please, report all the issues and suggest improvements.
Current setup might require additional horsepower to consume bigger datasets. Also we plan to implement whole repo autodeployment, too many fails because of lack of this feature at the moment.
List of current issues:
https://github.com/sciAI/exe/blob/master/Executability%20issues.md
Validator code:
All the credits are going to sci.AI team and, especially @AlexanderPashuk. Alex, thank you for the effort and fights with libs compatibility.
@yuvipanda, nice job. High five! Dan mentioned, you are on the conference now, right? If you are interested, we can combine effort
We're both at the conference now and will be at the hack sessions tomorrow!
I was traveling during the hackathon - had heard about it too late. In any case, I hope we can combine efforts with the Binder team. For those interested, the talk sits at https://github.com/Daniel-Mietchen/events/blob/master/JupyterCon-2017.md#outline-of-the-talk , and the video shall hopefully become available in a few weeks.
Seen at JupyterCon: https://github.com/jupyter/repo2docker , a tool that can dockerize a git repo and provide a Jupyter notebook to explore the container.
Hello! I'm part of the team that's working on beta.mybinder.org and related stuff, and am extremely interested in the idea of a 'badge that represents that your code is reproducible, and has been reproduced by CI'. Funnily, I also built the PAWS stuff, which is awesome to find in completely unrelated contexts :D
Apologies for reviving a closed issue. I am also interested in the reproducibility badge (which is not necessarily the same as the binder
badge). I came across @mwoodbri 's jupyter-ci. Any plans for this to be used or people that currently use this?
(Also cc'ing @yuvipanda as he has been involved in repo2docker
.)
@cgpu Hi! At the time jupyter-ci
was an answer to "what's the simplest possible way to generate a badge and notify of validation failure?". Simply porting the project to GitHub and GitHub Actions (updating to latest Jupyter best-practice in the process, if necessary) would be a great start. But a more general solution involving repo2docker
and/or Binder would be even better!
@cgpu Hi! At the time
jupyter-ci
was an answer to "what's the simplest possible way to generate a badge and notify of validation failure?". Simply porting the project to GitHub and GitHub Actions (updating to latest Jupyter best-practice in the process, if necessary) would be a great start. But a more general solution involvingrepo2docker
and/or Binder would be even better!
Hi @mwoodbri , thank you for the prompt response and the background information! I am really fond of the idea of having a binary does it reproduce
badge of honour, jupyter-ci
sounds exactly like that. I am trying to reproduce .ipynb
accompanied publications as part of a workshop exercise and I am really struggling. It would be nice to know before hands which ones to invest time on, hence the interest. Additionally, it would be nice to have/create a hub or awesome
repo with the ones that are reproducible only (verified by ci).
jupyter-ci
is currently only on gitlab, I know it's the same in essence, but the community is much more active in Github, it would be nice if you have plans to maintain the repo + idea long term to bring it over here. Just a thought.
Thanks once again!
@cgpu Here's a version converted to use GitHub Actions: https://github.com/mwoodbri/jupyter-ci
@cgpu Here's a version converted to use GitHub Actions: https://github.com/mwoodbri/jupyter-ci
@mwoodbri thank you! Time for me to test now :)
@Daniel-Mietchen thank you for providing the reproducibility cafe space for further discussions 2 years after the start of the initiative. Feel free to close this.
Hello everyone. It has been a while since the last post in this thread, but I am happy to report that there is now a preprint that reports on a reproducibility analysis of the Jupyter notebooks associated with publications available via PubMed Central: Computational reproducibility of Jupyter notebooks from biomedical publications — joint work with @Sheeba-Samuel . Here is the abstract:
Jupyter notebooks allow to bundle executable code with its documentation and output in one interactive environment, and they represent a popular mechanism to document and share computational workflows, including for research publications. Here, we analyze the computational reproducibility of 9625 Jupyter notebooks from 1117 GitHub repositories associated with 1419 publications indexed in the biomedical literature repository PubMed Central. 8160 of these were written in Python, including 4169 that had their dependencies declared in standard requirement files and that we attempted to re-run automatically. For 2684 of these, all declared dependencies could be installed successfully, and we re-ran them to assess reproducibility. Of these, 396 notebooks ran through without any errors, including 245 that produced results identical to those reported in the original. Running the other notebooks resulted in exceptions. We zoom in on common problems and practices, highlight trends and discuss potential improvements to Jupyter-related workflows associated with biomedical publications.
For data and code, see https://doi.org/10.5281/zenodo.6802158 .
I'll keep this thread open until the paper is formally published, and invite your comments in the meantime. Extra pings to some of you who have contributed to this thread before: @mwoodbri @khinsen @yuvipanda @rossmounce @tompollard @RomanGurinovich @choldgraf @JosephMcArthur .
Congrats @Daniel-Mietchen, excellent! I look forward to reading the paper, and will be sure to include it in my reading list for next year's reproducible research course I teach at Berkeley!
cc @facusapienza21.
Lots of twitter interest in this preprint. Might be good to dig a bit deeper into all those unknown dependency resolution issues - or perhaps just feature a couple of examples as a panel. There were some comments regarding Docker.
Thanks @Daniel-Mietchen for the update! The preprint is on my e-book reader.
Holy shit this is AWESOME!
We're still working on the revision of the paper but here are the slides of our JupyterCon 2023 talk: https://doi.org/10.5281/zenodo.7854503 .
Slide 23 — How you can get involved — asks for community input along a number of dimensions, which I am copying here.
We recently submitted the revision of the paper — see https://arxiv.org/abs/2308.07333 for the latest preprint, which describes a complete re-run of our pipeline and provides some more contextualization. In the discussion, we also briefly touch upon scaling issues with such reproducibility studies, mentioning ReScience (pinging @khinsen) as an example. We are keen on putting this dataset to use, e.g. in educational settings (cc @fperez ).
As always, comments etc. are welcome.
Dear all, thanks for your participation in this thread.
The paper on this (with @Sheeba-Samuel) was published yesterday: Computational reproducibility of Jupyter notebooks from biomedical publications, https://doi.org/10.1093/gigascience/giad113 .
We remain interested in
@mwoodbri @fperez @khinsen @yuvipanda @rossmounce @tompollard @RomanGurinovich @choldgraf @JosephMcArthur .
With that, I am closing this ticket after nearly 7 years - feel free to open up new ones in relevant places to discuss potential follow-ups.
Extremely impressive feat! Well done @Sheeba-Samuel & @Daniel-Mietchen !
Jupyter notebooks are a popular vehicle these days to share data science workflows. To get an idea of best practices in this regard, it would be good to analyze a good number of them in terms of their reproducibility and other aspects of usability (e.g. documentation, ease of reuse).
A search in PubMed Central (PMC) reveals the following results:
With currently just 102 hits, a systematic reanalysis seems entirely doable and could perhaps itself be documented by way of reproducible notebooks that might eventually end up being mentioned in PMC.
A good starting point here could be An open RNA-Seq data analysis pipeline tutorial with an example of reprocessing data from a recent Zika virus study, for which both a Jupyter notebook and a Docker image are available.
I plan to give a lightning talk on this. Some background is in this recent news piece.