jupyterhub / mybinder.org-user-guide

Turn a Git repo into a collection of interactive notebooks. This is Binder's user documentation repository.
https://mybinder.readthedocs.io
BSD 3-Clause "New" or "Revised" License
159 stars 103 forks source link

Add a guide to pinning dependencies #161

Open choldgraf opened 5 years ago

choldgraf commented 5 years ago

In a recent debugging session, @minrk pointed out that if you only partially pin your repository (e.g. pin numpy versions but don't pin Python), you are likely going to break future-reproducibility. This is because the non-pinned version may stop supporting your pinned version.

We have a short guide to reproducibility here, and this would make a nice addition!

Edit from Tim: Want to help out with this? Suggested steps on what needs doing are here.

betatim commented 5 years ago

An issue to find and link from here is the one about adding a repo2docker freeze command that produces a "well pinned" environment.yml for you.

minrk commented 5 years ago

Illustrative example that is coming up for folks right now:

# requirements.txt
notebook==5.7.4

Notebook 5.7.4 requires tornado>=4.3, but tornado 6 has been released since with some changes that break notebook 5.7.4. Notebook 5.7.5 is released with the fix for tornado 6 compatibility. By pinning notebook and not tornado, you are guaranteeing future breakage because your env is allowing a package's dependencies to be upgraded, but not allowing the package itself to receive its upgrades that are needed to keep compatibility with dependencies.

Two general approaches:

  1. freeze everything, e.g. with conda env export, or pip freeze plus requirements.txt ("reproducible environment"),
  2. latest of everything, and trust package maintainers to keep up ("living environment"). This may require you to update your repo once in a while, but it also keeps your repo relevant.

Specific things that should generally be avoided:

  1. exact pinning of just a few direct dependencies
  2. exact pinning of anything that has dependencies without also pinning its dependencies
  3. exact pinning of anything without also pinning the runtime (e.g. Python), but especially compiled packages like numpy, etc.
choldgraf commented 5 years ago

so....it sounds like we should recommend an "all or none" approach, no?

betatim commented 5 years ago

+1 on Chris' suggestion and I like Min's example.

Action to take for someone who wants to help out on this issue: Take Min's comment and include it in the guide to reproducibility the source of which is https://github.com/jupyterhub/binder/blob/master/doc/tutorials/reproducibility.rst

mdeff commented 5 years ago

I like the "all or none" recommendation from @minrk.

I think what's missing is "best practices" on how to achieve "pin none" and "pin all", and when to choose which. (I faced the future-reproducibility issue myself by forgetting to pin the python version.)

Things to keep in mind:

Dependency management and reproducibility are really hard. Surely people have thought about these issues before. But where?

minrk commented 5 years ago

I think this is the "repo2docker freeze" command that's been discussed a few times. Essentially, it would run repo2docker to install everything and then run conda env export and/or pip freeze to generate the "frozen" version of the env within the repo2docker environment. There are several version of freeze, depending on the kind of environment.

A first version of this is to use conda env export as the command passed to repo2docker, i.e.

$ jupyter-repo2docker https://github.com/binder-examples/conda -- conda env export -n root
Picked Git content provider.
Cloning into '/var/folders/9p/clj0fc754y35m01btd46043c0000gn/T/repo2dockerdc0h_3ne'...
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 6 (delta 0), reused 2 (delta 0), pack-reused 0
Unpacking objects: 100% (6/6), done.
Reusing existing image (r2dhttps-3a-2f-2fgithub-2ecom-2fbinder-2dexamples-2fconda4373085), not building.name: root
channels:
  - conda-forge
  - defaults
dependencies:
  - attrs=19.1.0=py_0
  - backcall=0.1.0=py_0
  - bleach=3.1.0=py_0
  - bokeh=1.0.4=py37_1000
  - bzip2=1.0.6=h14c3975_1002
  - ca-certificates=2019.3.9=hecc5488_0
  - certifi=2019.3.9=py37_0
  - click=7.0=py_0
  - cloudpickle=0.8.1=py_0
  - conda=4.5.11=py37_1000
  - cryptography=2.6.1=py37h72c5cf5_0
  - cycler=0.10.0=py_1
  - cytoolz=0.9.0.1=py37h14c3975_1001
  - dask=1.1.4=py_0
  - dask-core=1.1.4=py_0
  - decorator=4.4.0=py_0
  - defusedxml=0.5.0=py_1
  - dill=0.2.9=py37_0
  - distributed=1.26.0=py37_1
  - entrypoints=0.3=py37_1000
  - expat=2.2.5=hf484d3e_1002
  - fontconfig=2.13.1=he4413a7_1000
  - freetype=2.10.0=he983fc9_0
  - gettext=0.19.8.1=hc5be6a0_1002
  - glib=2.56.2=had28632_1001
  - heapdict=1.0.0=py37_1000
  - icu=58.2=hf484d3e_1000
  - ipykernel=5.1.0=py37h24bf2e0_1002
  - ipython=7.4.0=py37h24bf2e0_0
  - ipython_genutils=0.2.0=py_1
  - ipywidgets=7.4.2=py_0
  - jedi=0.13.3=py37_0
  - jinja2=2.10=py_1
  - jpeg=9c=h14c3975_1001
  - jsonschema=3.0.1=py37_0
  - jupyter_client=5.2.4=py_3
  - jupyter_core=4.4.0=py_0
  - jupyterlab=0.35.4=py37_0
  - jupyterlab_server=0.2.0=py_0
  - kiwisolver=1.0.1=py37h6bb024c_1002
  - libblas=3.8.0=4_openblas
  - libcblas=3.8.0=4_openblas
  - libffi=3.2.1=he1b5a44_1006
  - libgfortran=3.0.0=1
  - libiconv=1.15=h516909a_1005
  - liblapack=3.8.0=4_openblas
  - libpng=1.6.36=h84994c4_1000
  - libsodium=1.0.16=h14c3975_1001
  - libtiff=4.0.10=h648cc4a_1001
  - libuuid=2.32.1=h14c3975_1000
  - libxcb=1.13=h14c3975_1002
  - libxml2=2.9.8=h143f9aa_1005
  - locket=0.2.0=py_2
  - markupsafe=1.1.1=py37h14c3975_0
  - matplotlib=3.0.3=py37_0
  - matplotlib-base=3.0.3=py37h167e16e_0
  - mistune=0.8.4=py37h14c3975_1000
  - msgpack-python=0.6.1=py37h6bb024c_0
  - nbconvert=5.4.1=py_2
  - nbformat=4.4.0=py_1
  - ncurses=6.1=hf484d3e_1002
  - notebook=5.7.6=py37_0
  - numpy=1.16.2=py37h8b7e671_1
  - olefile=0.46=py_0
  - openblas=0.3.5=ha44fe06_0
  - openssl=1.1.1b=h14c3975_1
  - packaging=19.0=py_0
  - pandas=0.24.2=py37hf484d3e_0
  - pandoc=2.7.1=0
  - pandocfilters=1.4.2=py_1
  - parso=0.3.4=py_0
  - partd=0.3.9=py_0
  - pexpect=4.6.0=py37_1000
  - pickleshare=0.7.5=py37_1000
  - pillow=5.4.1=py37h00a061d_1000
  - pip=19.0.3=py37_0
  - prometheus_client=0.6.0=py_0
  - prompt_toolkit=2.0.9=py_0
  - psutil=5.6.1=py37h14c3975_0
  - pthread-stubs=0.4=h14c3975_1001
  - ptyprocess=0.6.0=py37_1000
  - pygments=2.3.1=py_0
  - pyparsing=2.3.1=py_0
  - pyqt=5.6.0=py37h13b7fb3_1008
  - pyrsistent=0.14.11=py37h14c3975_0
  - python=3.7.2=h381d211_0
  - python-dateutil=2.8.0=py_0
  - pytz=2018.9=py_0
  - pyyaml=5.1=py37h14c3975_0
  - pyzmq=18.0.1=py37h0e1adb2_0
  - readline=7.0=hf8c457e_1001
  - send2trash=1.5.0=py_0
  - setuptools=40.8.0=py37_0
  - sip=4.18.1=py37hf484d3e_1000
  - six=1.12.0=py37_1000
  - sortedcontainers=2.1.0=py_0
  - sqlite=3.26.0=h67949de_1001
  - tblib=1.3.2=py_1
  - terminado=0.8.1=py37_1001
  - testpath=0.4.2=py_1001
  - tk=8.6.9=h84994c4_1000
  - toolz=0.9.0=py_1
  - tornado=6.0.1=py37h14c3975_0
  - traitlets=4.3.2=py37_1000
  - wcwidth=0.1.7=py_1
  - webencodings=0.5.1=py_1
  - wheel=0.33.1=py37_0
  - widgetsnbextension=3.4.2=py37_1000
  - xorg-libxau=1.0.9=h14c3975_0
  - xorg-libxdmcp=1.1.3=h516909a_0
  - xz=5.2.4=h14c3975_1001
  - zeromq=4.2.5=hf484d3e_1006
  - zict=0.1.4=py_0
  - zlib=1.2.11=h14c3975_1004
  - asn1crypto=0.24.0=py37_0
  - cffi=1.11.5=py37he75722e_1
  - chardet=3.0.4=py37_1
  - conda-env=2.6.0=1
  - dbus=1.13.2=h714fa37_1
  - gst-plugins-base=1.14.0=hbbd80ab_1
  - gstreamer=1.14.0=hb453b48_1
  - idna=2.7=py37_0
  - libedit=3.1.20170329=h6b74fdf_2
  - libgcc-ng=8.2.0=hdf63c60_1
  - libstdcxx-ng=8.2.0=hdf63c60_1
  - pcre=8.43=he6710b0_0
  - pycosat=0.6.3=py37h14c3975_0
  - pycparser=2.18=py37_1
  - pyopenssl=18.0.0=py37_0
  - pysocks=1.6.8=py37_0
  - qt=5.6.3=h8bf5577_3
  - requests=2.19.1=py37_0
  - ruamel_yaml=0.15.46=py37h14c3975_0
  - urllib3=1.23=py37_0
  - yaml=0.1.7=had09818_2
  - pip:
    - alembic==1.0.8
    - async-generator==1.10
    - jupyterhub==0.9.4
    - mako==1.0.8
    - msgpack==0.6.1
    - nteract-on-jupyter==2.0.0
    - pamela==1.0.0
    - python-editor==1.0.4
    - python-oauth2==1.1.0
    - sqlalchemy==1.3.1
prefix: /srv/conda

We'll then want to figure out what to do about "lockfiles" since this freeze pattern generally means there are two files: one that specifies the loose requirements, and one that records an actual working installation (Pipfile.lock, etc.). To use this right now, you would have to clobber the environment.yml, or use top-level environment.yml for loose and binder/environment.yml for frozen or something similar.

betatim commented 5 years ago

I just learnt about pip install --constraint constraints.txt combined with pip freeze > constraints.txt via https://twitter.com/ChristianHeimes/status/1111228403250876417

pganssle commented 5 years ago

One thing to note here is that because binder apparently uses a specific conda env with a bunch of packages already pinned as the base Python environment, pip freeze > requirements.txt is not a very reliable way to create a reproducible environment.

This is because pip and conda are not perfectly compatible, and apparently if pip finds a conflicting requirement that has been installed by conda, it will fail (presumably because conda does not have the same install-time metadata or something, so pip doesn't know what to remove). Here's a minimal repo that reproduces the issue, you can see that the binder for this repo fails to build.

I think for now the documentation should be updated to mention that the pip freeze mechanism is unreliable, and you are better off using a conda env. In the long run, maybe binder can be updated to use a virtualenv if the repo has requirements.txt (and maybe runtime.txt) but not environment.yaml.

wragge commented 5 years ago

Thanks for the useful discussion! I haven't been including pinned versions of all dependencies, so I need to rethink what I'm doing.

@mdeff asks above:

I've seen binder overhaul pinned versions of jupyter. Does it still do that? What should we do if the version required by binder is different that the one required by the github repo? We could maybe recommend to not pin jupyter (or even to not list it as a dependency, it's just an editor after all), but what about its dependencies then?

I'm wondering the same thing. When I include a pinned version of jupyter I've found that the images build ok, but don't launch, so I've been leaving it out.

More generally, are there packages that shouldn't be pinned? For example, I just tried generating a new requirements.txt via pip freeze for a repo and found that the Binder build dies complaining that pip can't uninstall certifi as it's installed by distutils.