pangeo-data / pangeo

Pangeo website + discussion of general issues related to the project.
http://pangeo.io
693 stars 187 forks source link

organization of example notebook gallery #201

Closed rabernat closed 5 years ago

rabernat commented 6 years ago

Within the pangeo / xarray milieu, we have a proliferation of gallery / example sites

This is not an ideal situation.

Quoting @fmaussion in #191

one issue is "simply" organizational and can be solved by dedicated refactoring / selection of examples. The other issue I see (here and for other computationally expensive projects I'm involved in) is that some examples can't be built on ReadTheDocs, which is awesome but slow and unreliable. Building the docs elsewhere would help to have more expensive examples with the risk of static notebooks which will be difficult to maintain.

What are some ideas for both how to re-organize these and also how to auto-build expensive notebooks?

rabernat commented 6 years ago

One idea would be to use pangeo.pydata.org to build the expensive notebooks. If we could find some CI magic to actually spawn a pod within the cluster and build the docs from it (we currently use travis: travis.yml), then we could also spawn dask clusters directly from the notebooks. This would be extremely cool.

mrocklin commented 6 years ago

Dask examples live here: https://github.com/dask/dask-examples

mrocklin commented 6 years ago

One idea would be to use pangeo.pydata.org to build the expensive notebooks

I'm against this. These examples are likely to outlive our deployment here. We're also not stable enough to form this level of infrastructure.

Instead, I suspect that people usually use travis-ci or circle-ci. For very large examples I suggest that we just discourage them or else run them statically and don't test them.

rabernat commented 6 years ago

I see you point. So that suggests examples will fall into two categories:

jbednar commented 6 years ago

FYI, my group's sites Datashader.org , HoloViews.org, PyViz.org, and EarthSim follow roughly the pattern that Matt is describing, using Travis CI to build all the notebooks that are practical to build in the CI system. There are other notebooks (e.g. the Datashader Topics) that require data files too big to be fetching during a Travis build, and for those we use the same machinery on a local machine (still automated, but only spawned explicitly) and publish the results. There's one additional wrinkle, which is that even for the huge datasets we make a tiny subset available for the regular CI builds so that the basic process of running the notebook is tested and repeatable; it's only the large-data aspects that are done manually. We are in the process of making all of this infrastructure available as the NBSite and PyCT projects that automate the process of turning notebooks into automatically built, tested, and deployed websites. If you're interested, the projects are still a bit rough (especially in visual appearance!), but are rapidly settling down to provide support for the types of work described above.

rabernat commented 6 years ago

nbsite looks amazing and perfect for many of these needs. Would you care to estimate how long until a reasonably stable release is available?

For the sake of Friday afternoon discussion, let me make this a little more ambitious...the logical extension of what we are talking about is a framework for publishing reproducible science. One long term goal for pangeo could be to actually publish a peer-reviewed journal where computational reproducibility is baked into the publishing infrastructure. Imagine if every paper in a journal were a complete, self-executing document that could be spawned into an executable environment on demand. Such "papers" could also grow with the underlying datasets. For example, a timeseries of sea level rise could continue to update long after the original publication date as new observations come on line.

While it may sound far-fetched, I think all the basic technology to achieve this already exists.

mrocklin commented 6 years ago

This sounds similar to https://mybinder.org/

On Fri, Apr 6, 2018 at 3:18 PM, Ryan Abernathey notifications@github.com wrote:

nbsite looks amazing and perfect for many of these needs. Would you care to estimate how long until a reasonably stable release is available?

For the sake of Friday afternoon discussion, let me make this a little more ambitious...the logical extension of what we are talking about is a framework for publishing reproducible science. One long term goal for pangeo could be to actually publish a peer-reviewed journal where computational reproducibility is baked into the publishing infrastructure. Imagine if every paper in a journal were a complete, self-executing document that could be spawned into an executable environment on demand. Such "papers" could also grow with the underlying datasets. For example, a timeseries of sea level rise could continue to update long after the original publication date as new observations come on line.

While it may sound far-fetched, I think all the basic technology to achieve this already exists.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo/issues/201#issuecomment-379351537, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszOl1mLknZ_AWPjqG_pklcDK1G1gNks5tl7-cgaJpZM4TKa4z .

rabernat commented 6 years ago

Yes that is part of the technology I am referring to. But to achieve what I described, one would need to couple the binder execution environment to the actual building of the published html content.

jbednar commented 6 years ago

As it happens, we need that too. :-) Adding in an easy way to deploy a notebook into a runnable binder-based copy is already on our list, but that's a goal we've put aside for the time being to focus on making the overall workflow for turning the crank from notebooks to deployed (static) site be more polished and more straightforward. As soon as that's done, it's back to working on putting in the "Try me out on Binder" button.

NBSite is already fully functional and is in use for building EarthSim and PyViz.org, but we have about a dozen other projects using various older versions of it, and we are currently focusing on cleaning up a few rough bits and then deploying it uniformly across all our projects so that we can have only one single package to maintain. This coming week we'll all be putting on a conference, but this is something we'll get back to as soon as we get back and should expect to be deploying over the following week. So it's available already, but if you can hold off a couple of weeks we'll have it much more in a state ready to copy and spread.

jbednar commented 6 years ago

BTW, NBSite focuses on making it easy to maintain a Python package supported by notebooks, but our original interest in making HoloViews work well with notebooks came from trying to ensure that our research was reproducible, for which we've published a manifesto. That paper predates some of the tools now available, but I think that the combination of:

is a pretty compelling way to capture what's been achieved, archive it, and then launch people off to build on the work. If only all my students had had this when finishing up all their projects, so other people could pick up precisely where they left off!

tjcrone commented 6 years ago

https://www.theatlantic.com/science/archive/2018/04/the-scientific-paper-is-obsolete/556676/

c/o @friedrichknuth

kmpaul commented 6 years ago

Anybody thought about using https://github.com/nbgallery/nbgallery? I've heard some people say nice things about it and its extensions, https://github.com/nbgallery/nbgallery-extensions, which can integrate it with JupyterHub.

rabernat commented 6 years ago

Do we know if it is possible to use nbgallery but not rebuild the notebooks. As discussed above, some use case notebooks will be very expensive computationally and not practical to auto-build. It sounds like nbgallery is really designed as a notebook build environment.

edit: by "build" I mean execute

kmpaul commented 6 years ago

@rabernat Yeah. I'm not sure. I've just heard some people "rave" about nbgallery, but I think you are correct; It provides an environment that executes the notebooks. I will downgrade my own response.

Chilipp commented 6 years ago

Concerning the issue about expensive notebooks on RTD, I also developed a sphinx extension named sphinx-nbexamples that works the same as sphinx-gallery but renders jupyter notebooks instead of python scripts by using nbconvert.

Therefore you can select in the sphinx conf.py, whether you want to execute the notebook or not (see the docs), e.g. via

example_gallery_config = dict(
    dont_preprocess=['../examples/expensive_notebook.ipynb'],
    )

I use that in my psyplot project and you can also link to other documentations to include their examples as well.

rabernat commented 6 years ago

@Chilipp I just looked into sphinx-nbexamples. It looks perfect for the pangeo website.

If you would like to try to refactor our use case gallery to use sphinx-nbexamples, that would be extremely appreciated. PR welcome!

Chilipp commented 6 years ago

Sure @rabernat! I'll keep you posted

jhamman commented 6 years ago

+1 for sphinx-nbexamples. I have been working on setting up a pangeo binder server (#283) which would allow us to launch our notebook examples directly from our documentation site.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.