Open choldgraf opened 5 years ago
An issue to find and link from here is the one about adding a repo2docker freeze
command that produces a "well pinned" environment.yml
for you.
Illustrative example that is coming up for folks right now:
# requirements.txt
notebook==5.7.4
Notebook 5.7.4 requires tornado>=4.3, but tornado 6 has been released since with some changes that break notebook 5.7.4. Notebook 5.7.5 is released with the fix for tornado 6 compatibility. By pinning notebook and not tornado, you are guaranteeing future breakage because your env is allowing a package's dependencies to be upgraded, but not allowing the package itself to receive its upgrades that are needed to keep compatibility with dependencies.
Two general approaches:
conda env export
, or pip freeze
plus requirements.txt ("reproducible environment"),Specific things that should generally be avoided:
so....it sounds like we should recommend an "all or none" approach, no?
+1 on Chris' suggestion and I like Min's example.
Action to take for someone who wants to help out on this issue: Take Min's comment and include it in the guide to reproducibility the source of which is https://github.com/jupyterhub/binder/blob/master/doc/tutorials/reproducibility.rst
I like the "all or none" recommendation from @minrk.
I think what's missing is "best practices" on how to achieve "pin none" and "pin all", and when to choose which. (I faced the future-reproducibility issue myself by forgetting to pin the python version.)
Things to keep in mind:
conda env export
is not cross-platform, i.e., you cannot create the env on linux, export, and recreate on Windows or macOS. You need to manually export on the three platforms, and only pin the lowest denominator. Then pray that packages have pinned the versions of their dependencies sufficiently tight so that it won't break too soon... I've hit this issue when teaching a university course, where students could have any platform, and I wanted binder as a backup.repo2docker
could be used to create a future-proof Dockerfile. That would pin the dependency chain down to the kernel API, which is absolutely stable. The downside is that binder would need to guarantee to support those Dockerfiles long-term. (A deprecation warning could be automatically raised when an old Dockerfile version is built, even automatically creating an issue in the github repo.)Dependency management and reproducibility are really hard. Surely people have thought about these issues before. But where?
I think this is the "repo2docker freeze" command that's been discussed a few times. Essentially, it would run repo2docker to install everything and then run conda env export
and/or pip freeze
to generate the "frozen" version of the env within the repo2docker environment. There are several version of freeze, depending on the kind of environment.
A first version of this is to use conda env export
as the command passed to repo2docker, i.e.
$ jupyter-repo2docker https://github.com/binder-examples/conda -- conda env export -n root
Picked Git content provider.
Cloning into '/var/folders/9p/clj0fc754y35m01btd46043c0000gn/T/repo2dockerdc0h_3ne'...
remote: Enumerating objects: 6, done.
remote: Counting objects: 100% (6/6), done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 6 (delta 0), reused 2 (delta 0), pack-reused 0
Unpacking objects: 100% (6/6), done.
Reusing existing image (r2dhttps-3a-2f-2fgithub-2ecom-2fbinder-2dexamples-2fconda4373085), not building.name: root
channels:
- conda-forge
- defaults
dependencies:
- attrs=19.1.0=py_0
- backcall=0.1.0=py_0
- bleach=3.1.0=py_0
- bokeh=1.0.4=py37_1000
- bzip2=1.0.6=h14c3975_1002
- ca-certificates=2019.3.9=hecc5488_0
- certifi=2019.3.9=py37_0
- click=7.0=py_0
- cloudpickle=0.8.1=py_0
- conda=4.5.11=py37_1000
- cryptography=2.6.1=py37h72c5cf5_0
- cycler=0.10.0=py_1
- cytoolz=0.9.0.1=py37h14c3975_1001
- dask=1.1.4=py_0
- dask-core=1.1.4=py_0
- decorator=4.4.0=py_0
- defusedxml=0.5.0=py_1
- dill=0.2.9=py37_0
- distributed=1.26.0=py37_1
- entrypoints=0.3=py37_1000
- expat=2.2.5=hf484d3e_1002
- fontconfig=2.13.1=he4413a7_1000
- freetype=2.10.0=he983fc9_0
- gettext=0.19.8.1=hc5be6a0_1002
- glib=2.56.2=had28632_1001
- heapdict=1.0.0=py37_1000
- icu=58.2=hf484d3e_1000
- ipykernel=5.1.0=py37h24bf2e0_1002
- ipython=7.4.0=py37h24bf2e0_0
- ipython_genutils=0.2.0=py_1
- ipywidgets=7.4.2=py_0
- jedi=0.13.3=py37_0
- jinja2=2.10=py_1
- jpeg=9c=h14c3975_1001
- jsonschema=3.0.1=py37_0
- jupyter_client=5.2.4=py_3
- jupyter_core=4.4.0=py_0
- jupyterlab=0.35.4=py37_0
- jupyterlab_server=0.2.0=py_0
- kiwisolver=1.0.1=py37h6bb024c_1002
- libblas=3.8.0=4_openblas
- libcblas=3.8.0=4_openblas
- libffi=3.2.1=he1b5a44_1006
- libgfortran=3.0.0=1
- libiconv=1.15=h516909a_1005
- liblapack=3.8.0=4_openblas
- libpng=1.6.36=h84994c4_1000
- libsodium=1.0.16=h14c3975_1001
- libtiff=4.0.10=h648cc4a_1001
- libuuid=2.32.1=h14c3975_1000
- libxcb=1.13=h14c3975_1002
- libxml2=2.9.8=h143f9aa_1005
- locket=0.2.0=py_2
- markupsafe=1.1.1=py37h14c3975_0
- matplotlib=3.0.3=py37_0
- matplotlib-base=3.0.3=py37h167e16e_0
- mistune=0.8.4=py37h14c3975_1000
- msgpack-python=0.6.1=py37h6bb024c_0
- nbconvert=5.4.1=py_2
- nbformat=4.4.0=py_1
- ncurses=6.1=hf484d3e_1002
- notebook=5.7.6=py37_0
- numpy=1.16.2=py37h8b7e671_1
- olefile=0.46=py_0
- openblas=0.3.5=ha44fe06_0
- openssl=1.1.1b=h14c3975_1
- packaging=19.0=py_0
- pandas=0.24.2=py37hf484d3e_0
- pandoc=2.7.1=0
- pandocfilters=1.4.2=py_1
- parso=0.3.4=py_0
- partd=0.3.9=py_0
- pexpect=4.6.0=py37_1000
- pickleshare=0.7.5=py37_1000
- pillow=5.4.1=py37h00a061d_1000
- pip=19.0.3=py37_0
- prometheus_client=0.6.0=py_0
- prompt_toolkit=2.0.9=py_0
- psutil=5.6.1=py37h14c3975_0
- pthread-stubs=0.4=h14c3975_1001
- ptyprocess=0.6.0=py37_1000
- pygments=2.3.1=py_0
- pyparsing=2.3.1=py_0
- pyqt=5.6.0=py37h13b7fb3_1008
- pyrsistent=0.14.11=py37h14c3975_0
- python=3.7.2=h381d211_0
- python-dateutil=2.8.0=py_0
- pytz=2018.9=py_0
- pyyaml=5.1=py37h14c3975_0
- pyzmq=18.0.1=py37h0e1adb2_0
- readline=7.0=hf8c457e_1001
- send2trash=1.5.0=py_0
- setuptools=40.8.0=py37_0
- sip=4.18.1=py37hf484d3e_1000
- six=1.12.0=py37_1000
- sortedcontainers=2.1.0=py_0
- sqlite=3.26.0=h67949de_1001
- tblib=1.3.2=py_1
- terminado=0.8.1=py37_1001
- testpath=0.4.2=py_1001
- tk=8.6.9=h84994c4_1000
- toolz=0.9.0=py_1
- tornado=6.0.1=py37h14c3975_0
- traitlets=4.3.2=py37_1000
- wcwidth=0.1.7=py_1
- webencodings=0.5.1=py_1
- wheel=0.33.1=py37_0
- widgetsnbextension=3.4.2=py37_1000
- xorg-libxau=1.0.9=h14c3975_0
- xorg-libxdmcp=1.1.3=h516909a_0
- xz=5.2.4=h14c3975_1001
- zeromq=4.2.5=hf484d3e_1006
- zict=0.1.4=py_0
- zlib=1.2.11=h14c3975_1004
- asn1crypto=0.24.0=py37_0
- cffi=1.11.5=py37he75722e_1
- chardet=3.0.4=py37_1
- conda-env=2.6.0=1
- dbus=1.13.2=h714fa37_1
- gst-plugins-base=1.14.0=hbbd80ab_1
- gstreamer=1.14.0=hb453b48_1
- idna=2.7=py37_0
- libedit=3.1.20170329=h6b74fdf_2
- libgcc-ng=8.2.0=hdf63c60_1
- libstdcxx-ng=8.2.0=hdf63c60_1
- pcre=8.43=he6710b0_0
- pycosat=0.6.3=py37h14c3975_0
- pycparser=2.18=py37_1
- pyopenssl=18.0.0=py37_0
- pysocks=1.6.8=py37_0
- qt=5.6.3=h8bf5577_3
- requests=2.19.1=py37_0
- ruamel_yaml=0.15.46=py37h14c3975_0
- urllib3=1.23=py37_0
- yaml=0.1.7=had09818_2
- pip:
- alembic==1.0.8
- async-generator==1.10
- jupyterhub==0.9.4
- mako==1.0.8
- msgpack==0.6.1
- nteract-on-jupyter==2.0.0
- pamela==1.0.0
- python-editor==1.0.4
- python-oauth2==1.1.0
- sqlalchemy==1.3.1
prefix: /srv/conda
We'll then want to figure out what to do about "lockfiles" since this freeze pattern generally means there are two files: one that specifies the loose requirements, and one that records an actual working installation (Pipfile.lock, etc.). To use this right now, you would have to clobber the environment.yml, or use top-level environment.yml for loose and binder/environment.yml for frozen or something similar.
I just learnt about pip install --constraint constraints.txt
combined with pip freeze > constraints.txt
via https://twitter.com/ChristianHeimes/status/1111228403250876417
One thing to note here is that because binder apparently uses a specific conda
env with a bunch of packages already pinned as the base Python environment, pip freeze > requirements.txt
is not a very reliable way to create a reproducible environment.
This is because pip
and conda
are not perfectly compatible, and apparently if pip
finds a conflicting requirement that has been installed by conda
, it will fail (presumably because conda
does not have the same install-time metadata or something, so pip
doesn't know what to remove). Here's a minimal repo that reproduces the issue, you can see that the binder for this repo fails to build.
I think for now the documentation should be updated to mention that the pip freeze
mechanism is unreliable, and you are better off using a conda
env. In the long run, maybe binder
can be updated to use a virtualenv
if the repo has requirements.txt
(and maybe runtime.txt
) but not environment.yaml
.
Thanks for the useful discussion! I haven't been including pinned versions of all dependencies, so I need to rethink what I'm doing.
@mdeff asks above:
I've seen binder overhaul pinned versions of jupyter. Does it still do that? What should we do if the version required by binder is different that the one required by the github repo? We could maybe recommend to not pin jupyter (or even to not list it as a dependency, it's just an editor after all), but what about its dependencies then?
I'm wondering the same thing. When I include a pinned version of jupyter I've found that the images build ok, but don't launch, so I've been leaving it out.
More generally, are there packages that shouldn't be pinned? For example, I just tried generating a new requirements.txt
via pip freeze
for a repo and found that the Binder build dies complaining that pip can't uninstall certifi
as it's installed by distutils
.
In a recent debugging session, @minrk pointed out that if you only partially pin your repository (e.g. pin numpy versions but don't pin Python), you are likely going to break future-reproducibility. This is because the non-pinned version may stop supporting your pinned version.
We have a short guide to reproducibility here, and this would make a nice addition!
Edit from Tim: Want to help out with this? Suggested steps on what needs doing are here.