notes on tutorial notebook development strategies

uwhackweek / jupyterbook-template

Template repository for UW Hackweek JupyterBook websites

http://book-template.hackweek.io/

MIT License

9 stars 6 forks source link

notes on tutorial notebook development strategies #14

Open scottyhq opened 3 years ago

scottyhq commented 3 years ago

EDIT: This thread applies the SnowEx 2021 Event: https://github.com/snowex-hackweek/website

we've made the intentional choice to force everyone to use a common software environment for tutorial development. This environment is currently defined in a separate repository: https://github.com/uwhackweek/docker-template. We bundle the conda environment into a docker image so that it is available to run on JupyterHub and Binder services during and after a hackweek. There are pros and cons to this approach:

PROS:

we don't have to manage multiple environment and participants don't have to switch environments for each tutorial
the building process is simple
we guarantee that notebooks can run in jupyterhub and binder because they are run and outputs rendered in jupyterbook html

CONS:

adds a docker dependency for building the jupyterbook
all tutorials must be developed in the same website repository and use the same environment so there could be issues if people want to use different versions of the same software package

There are alternative approaches that add complexity for using more than one conda environment (or execution 'kernel') per tutorial notebook:

https://github.com/treebeardtech/nbmake-action/blob/main/.github/workflows/action_integration_test.yml

regards of a single kernel or different kernels, notebooks are also built via github actions and therefore are subject to certain hardware constraints, at the time of writing 2-core CPU, 7 GB of RAM memory, 14 GB of SSD disk space

If developing more advanced tutorials that require more RAM, or distributed clusters one solution is to use a custom binderhub to run the notebook and return rendered outputs. That is what is done for https://gallery.pangeo.io . See more discussion about this https://discourse.jupyter.org/t/creating-a-future-infrastructure-for-notebooks-to-be-submitted-and-peer-reviewed/3534/24, https://discourse.jupyter.org/t/binder-notebook-builder-bot/2311/6

scottyhq commented 3 years ago

The above comment mainly discusses computing requirements and software environment, but there is also the issue of data. For example, imagine we have 10 tutorials that each use 5GB of data. You could use Zenodo to upload a .tar file of 50GB, and notebooks could start by pulling that data and unzipping it. But if we are rendering the book website from tutorial notebooks this will fail b/c the GitHub Action server only has 14GB of data.

Ideally, all data is network-accessible (for example on S3) so that tutorial data can be streamed rather than downloaded first. That might not suit all tutorials. Some other ideas:

store tutorial data on S3. This is high-performance, but requires somebody to host the bucket on their account long-term.
create a tutorial-data GitHub repository (individual files limited to 100MB)
store data in Google Drive. @lsetiawan pointed out https://github.com/wkentaro/gdown as a simple utility for pulling large tar files.

none of these solutions for data hosting solve the 14GB upper limit for building the book. For larger books or tutorials using large datasets it might be useful to explore some of solutions above for building examples via binder or re-configure github actions to build each notebook separately (such that each gets a 14GB limit), then consolidate rendered HTML and publish the website.

lsetiawan commented 3 years ago

From @scottyhq, where he provided another potential solution:

just played around with an idea of making ZARR or COGS available via GitHub Pages, which I think is a cool solution for versioned tutorial data without needing high performance of object stores (https://github.com/scottyhq/zarrdata) main caveat is that individual files (or zarr chunks) would need to be less than 100MB and total repo size should be <1GB

This is a great potential solution, but would require converting tutorial data to zarr dataset. I was thinking that this would could be a really great solution, and it can be even better if we can figure out the best compression scheme, so that we can host bigger dataset without going over the limitations above: https://the-fonz.gitlab.io/posts/compress-zarr-meteo/

Something to possibly keep in mind is that converting tutorial data to zarr dataset, would require us to do more work on that end, making us a data provider. Unless we have an automated process for this ... so many options! :grimacing:

scottyhq commented 3 years ago

so many options! 😬

...Also opened a pangeo discourse forum so there will likely be even more showing up :) https://discourse.pangeo.io/t/recommendations-for-self-hosted-100gb-datasets/1469

scottyhq commented 3 years ago

Some additional notes (mainly challenges) after several tutorial contributions:

Any notebook that uses raw_input (commonly getpass for username and password) can't be built easily with building jupyterbook on github actions. scripts need to use public data or need a way to authenticate with environment variables or tokens that can be passed with github secrets
Building the book via docker adds additional complexity b/c need to inject environment variables (AWS credentials to read bucket, or in the case of NASA data need to create a ~/.netrc in the running docker container).

scottyhq commented 3 years ago

As the number of tutorial notebooks grows it quickly becomes apparent that it would be good to make use of a cache and only execute notebooks that have changed. https://jupyterbook.org/content/execute.html

For snowex hackweek, some tutorial notebooks take several minutes to run, resulting in 10+ minutes of build time in CI for each commit. I think only when the environment changes (new package versions in docker images) would you want to re-execute all notebooks. In that case it would also be nice to execute them in parallel since they are independent.

scottyhq commented 3 years ago

One thing to watch out for is accidentally bloating the website repository to >1GB with rendered jupyter notebook outputs in the gh-pages branch (for example, generating really high resolution matplotlib figures). fortunately we haven't had that issue, but with ~10 tutorial notebooks in snowex hackweek our repository quickly went to over 100MB. Seems there are newer git commands to check on size of branches. We could also add a CI check that reports the size of rendered notebooks. https://stackoverflow.com/questions/32557849/get-git-branch-size

git clone https://github.com/snowex-hackweek/website.git --> 237MB git clone -b main --single-branch https://github.com/snowex-hackweek/website.git --> 157 MB git clone --depth 1 https://github.com/snowex-hackweek/website.git --> 158MB

scottyhq commented 3 years ago

github actions and docker adds confusion over userid and permissions. for example, a tutorial notebook that installs an additional package can hit permissions issues:

# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

    Traceback (most recent call last):
      File "/srv/conda/lib/python3.8/site-packages/conda/exceptions.py", line 1079, in __call__
        return func(*args, **kwargs)
      File "/srv/conda/lib/python3.8/site-packages/mamba/mamba.py", line 882, in exception_converter
        raise e
      File "/srv/conda/lib/python3.8/site-packages/mamba/mamba.py", line 876, in exception_converter
        exit_code = _wrapped_main(*args, **kwargs)
      File "/srv/conda/lib/python3.8/site-packages/mamba/mamba.py", line 835, in _wrapped_main
        result = do_call(args, p)
      File "/srv/conda/lib/python3.8/site-packages/mamba/mamba.py", line 716, in do_call
        exit_code = install(args, parser, "install")
      File "/srv/conda/lib/python3.8/site-packages/mamba/mamba.py", line 514, in install
        index = load_channels(pool, channels, repos)
      File "/srv/conda/lib/python3.8/site-packages/mamba/utils.py", line 93, in load_channels
        index = get_index(
      File "/srv/conda/lib/python3.8/site-packages/mamba/utils.py", line 62, in get_index
        api.create_cache_dir(), api.cache_fn_url(full_url)
    RuntimeError: Permission denied: '/srv/conda/pkgs/cache'

`$ /srv/conda/condabin/mamba install -y -q tensorflow=2.5`

First issue can be solved with !CONDA_PKGS_DIRS=/tmp/pkgs mamba install -y -q tensorflow=2.5 but then you run into:

failed

EnvironmentNotWritableError: The current user does not have write permissions to the target environment.
  environment location: /srv/conda/envs/notebook
  uid: 1001
  gid: 0

jomey commented 3 years ago

Here a potential approach using gitattributes and the jupyter nbconvert command that clears notebooks output before a commit at the staging level of git. This could be part of the repository to reduce effort of people to setup their git command. Personally, I am using the nbconvert command on my local environment and it works well. Have not tried to automate that via the gitattributes, https://zhauniarovich.com/post/2020/2020-10-clearing-jupyter-output-p3/