jupyterhub / repo2docker

Turn repositories into Jupyter-enabled Docker images
https://repo2docker.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.62k stars 361 forks source link

Provide support or recommendation for how to interact with conda-lock lockfiles #1312

Open matthewfeickert opened 12 months ago

matthewfeickert commented 12 months ago

Proposed change

At the moment, repo2docker supports conda/mamba/micromamba environment.yml environment files as Binder config files. This is great, but even if you pin packages with == their dependencies can still float and so reproducibility into the future can break. For long term reproducible builds (e.g. launching into Binder from a Zenodo DOI) you would want to be able to also have repo2docker work with lock files. As the conda ecosystem is already supported a natural extension would be to use conda-lock, and with mamba/micromamba you can interact with conda-lock lock files on a nearly equal footing as you would an environment.yml.

However, at the moment, if you place a conda-lock lock file named environment.yml under a binder/ directory in a repo, repo2docker will fail to build from it and error with

EnvironmentSectionNotValid: The following sections on '/home/jovyan/binder/environment.yml' are invalid and will be ignored:
 - version
 - metadata
 - package

(c.f. https://github.com/matthewfeickert-talks/talk-pyhep-2023/pull/5)

It would be super nice if conda-lock lock files could have support added for them as a valid repo2docker config file.

Alternative options

Though if that is too big of a feature request, it would be nice if there was a method to allow users to interact with a conda-lock lock file that works with postBuild. At the moment, if you try to have a postBuild config file that has

conda env update --file binder/conda-lock.yml --prune

this will again fail with

EnvironmentSectionNotValid: The following sections on '/home/jovyan/binder/environment.yml' are invalid and will be ignored:
 - version
 - metadata
 - package

While micromamba is able to handle a command like

micromamba install --file binder/conda-lock.yml

it seems that conda can not and so similarly having a postBuild file with

conda install --file binder/conda-lock.yml

will fail with

CondaValueError: could not parse 'version: 1' in: binder/conda-lock.yml

If the ability to install an environment from a conda-lock lock file without supporting conda-lock could be supported then if instructions on how to work with conda-lock lock files were also added this could resolve things as well.

Who would use this feature?

People that want to ensure that a Binder link will run far into the future (so maybe the same people that put things on Zenodo).

How much effort will adding it take?

I'm not sure. I would hope not much, but I haven't taken the time to look at how repo2docker currently supports all the config files it already does.

Who can do this work?

Someone with familiarity with conda-lock.

manics commented 12 months ago

We already have lock files for pinning the base requirements, though these aren't yaml files: https://github.com/jupyterhub/repo2docker/tree/main/repo2docker/buildpacks/conda Is this a different type of lock file?

matthewfeickert commented 12 months ago

@manics This might be a conda-lock version issue. The conda-lock format was unified in conda-lock v1.0.0 (c.f. https://github.com/conda/conda-lock/pull/124)

https://github.com/conda/conda-lock/blob/425b384ffd010461d9a4f3c61d286e31a21f14f3/README.md?plain=1#L68-L76

By default, conda-lock store its output in conda-lock.yml in the current working directory. This file will also be used by default for render, install, and update operations. You can supply a different filename with e.g.

conda-lock --lockfile superspecial.conda-lock.yml

It seems though, that yes, the format of what you have is different. Example:

https://github.com/jupyterhub/repo2docker/blob/8c32db99878fa3cd532f2b9ee107cfded058088a/repo2docker/buildpacks/conda/environment.py-3.11-linux-64.lock#L1-L10

compared to something like https://iris-hep.org/analysis-systems-env-nightlies/iris-hep-rc/3.11/conda-lock.yml

# This lock file was generated by conda-lock (https://github.com/conda/conda-lock). DO NOT EDIT!
#
# A "lock file" contains a concrete list of package versions (with checksums) to be installed. Unlike
# e.g. `conda env create`, the resulting environment will not change as new package versions become
# available, unless you explicitly update the lock file.
#
# Install this environment as "YOURENV" with:
#     conda-lock install -n YOURENV --file conda-lock.yml
# To update a single package to the latest version compatible with the version constraints in the source:
#     conda-lock lock  --lockfile conda-lock.yml --update PACKAGE
# To re-solve the entire environment, e.g. after changing a version constraint in the source file:
#     conda-lock -f iris-hep-rc/3.11/environment.yml --lockfile conda-lock.yml
version: 1
metadata:
  content_hash:
    linux-64: e002febb8b04300e80dded8f2b7dabb269ace11a83f98db20719007774f0f52c
  channels:
  - url: conda-forge
    used_env_vars: []
  platforms:
  - linux-64
  sources:
  - iris-hep-rc/3.11/environment.yml
package:
- name: _libgcc_mutex
  version: '0.1'
  manager: conda
  platform: linux-64
  dependencies: {}
  url: https://conda.anaconda.org/conda-forge/linux-64/_libgcc_mutex-0.1-conda_forge.tar.bz2
  hash:
    md5: d7c89558ba9fa0495403155b64376d81
    sha256: fe51de6107f9edc7aa4f786a70f4a883943bc9d39b3bb7307c04c41410990726
  category: main
  optional: false
- name: ca-certificates
  version: 2023.7.22
  manager: conda
  platform: linux-64
  dependencies: {}
  url: https://conda.anaconda.org/conda-forge/linux-64/ca-certificates-2023.7.22-hbcca054_0.conda
  hash:
    md5: a73ecd2988327ad4c8f2c331482917f2
    sha256: 525b7b6b5135b952ec1808de84e5eca57c7c7ff144e29ef3e96ae4040ff432c1
  category: main
  optional: false
...

(edit)

Ah yes, here we go:

https://github.com/conda/conda-lock/blob/425b384ffd010461d9a4f3c61d286e31a21f14f3/README.md?plain=1#L57-L64

Pre 1.0 compatible usage (explicit per platform locks)

If you were making use of conda-lock before the 1.0 release that added unified lockfiles you can still get that behaviour by making use of the explicit output kind.

conda-lock --kind explicit -f environment.yml

So it seems that you're using the pre-v1.0 explicit lock file format over the v1.0+ unified lockfile.

bollwyvl commented 11 months ago

Supporting conda-lock outputs would be... nice, but would likely require some guardrails, and blessing some "r2d knows best" conventions.

conda-lock itself hauls in... a lot of dependencies, so may not be a good candidate for the "base coat" environment. micromamba, already present, is certainly up to the task of consuming both formats... though pixi very well might end up "winning" for this use case.

As, ideally, it would replace (not change) the notebook environment, supporting the raw lock (in either format) would ideally be able to preflight before doing a still-expensive download by:

The yml format directly supports dependencies from other package managers, like pip (and even other package manager managers like poetry and pipenv), while the @EXPLICIT format kinda half-supports them, but behind #s, so probably needs to be ignored entirely.

Thus far, there is no specific naming convention for A Well-Known Conda Lock File in a repo, as a number of "first-party" tools within the conda org don't agree on what the extension should even be:

bollwyvl commented 11 months ago

So, to tighten up the above as a recommendation:

itcarroll commented 2 months ago

Chiming in here with a user experience, leading to a question about the above recommendation. My goals are to keep only my project's dependencies in an environment.yml with minimal pinning, have some lockfile for protection against untested updates, and to not conflict with packages added by the conda buildpack. I do not understand how I can create or use a lockfile that is aware of the package constraints introduced in the conda buildpack. Wouldn't the recommendation, which uses create rather than update, require me to include jupyterhub-singleuser and friends with all the repo2docker constraints? If it were update though, how/where/when could I invoke conda-lock on my environment.yml and repo2docker's environment.yml?

Aside: If not for some few packages that seem to need the notebook kernel to be the same as the environment running jupyterhub-singleuser, I would have used a separate environment for my project's kernel.