Open wolfv opened 2 years ago
This WIP PR https://github.com/conda-incubator/conda-lock/pull/106 has a bunch of handy things that could also go into metadata
This lockfile spec should have a version and ideally a reference to some standard jsonschema representation of the structure.
version: 1
$schema: https://some/url/for/schema_v1.json
Ooh, this is very exciting!!!
My thoughts regarding the lockfile metadata are that I'd like my lockfiles to be self-documenting. For instance, I'd like them to know how they were created, for example with which command. I'd also like to be able to add "comments" as explanation for colleagues. That way by looking at the lockfile, it'll be obvious what it is, where it came from, and how to update it.
I think it's useful to be able to choose which fields to include or exclude. For instance, some people may find it useful to include the timestamp and the username, but others might find the timestamp annoying with git, or might not want to leak their username.
I have also realized that the feature I'd really like is to be able to stick my "non-explicit" dependencies in the metadata so that I can run a command like conda-lock update environment.lock
and have it rerun the solver and upgrade the lockfile in-place.
I was not bold enough to propose a new format, so I have been working with a header consisting of commented yaml. Obviously a proper YAML format like this would be better.
I have things mostly implemented; I'm just working on a system for versioning the metadata generation process so that it can be extended or modified in the future. I wanted to finish it this weekend but I ended up not having enough time.
Yeah, we could make this format a super set of the existing YAML environment files. So that you could have
name: ...
metadata:
...
channels:
- conda-forge
- bioconda
dependencies:
- abc >0.5
- xyz =1.15
package-lock:
linux-64:
- ...
then some micromamba command could automatically update the lockfiles + certain metadata keys.
~I think I already tested at some point whether or not conda
would accept an extra top-level item in the environment file (like metadata
), and unfortunately it didn't work. Thus we're unfortunately looking at a breaking change. (But who cares about conda
anymore? :wink:)~
EDIT: Or was it mamba
that didn't work??? Sorry, I take it back. I'm not sure anymore...
hmm, I don't think micromamba would complain about extra keys. With mamba, we're just using conda
code though, so it might happen, IDK! :)
For a normal environment file it seems to produce a warning.
$ conda env create -n testenv --file=env.yaml
EnvironmentSectionNotValid: The following section on '/env.yaml' is invalid and will be ignored:
- metadata
For an explicit lockfile, I'm getting:
CondaValueError: invalid package specification: metadata: asdf
Yeah, in explicit lockfiles you need to use a comment # metadata: whatever...
@maresb yeah, none of this exists yet, we're designing what we want from a lockfile format for conda.
Here is a summary of what I came up with in my PR:
conda-lock-metadata:
about: This lockfile was generated by conda-lock to ensure reproducibility.
comment: |-
Run the following command to update this project's dependencies.
command: conda-lock -f environment.yml --metadata=all
command_with_path: /root/conda/envs/conda-lock/bin/conda-lock -f environment.yml --metadata=all
conda_lock_version: 0.11.3.dev0+gf2ba8d4.d20210904
created_by: root
input_hash: f15a045753a401da73dd7c1693fd031e0ad41c0b4c9ca8545c0a8ab56c21d16c
platform: win-64
timestamp: 2021-09-05 23:43:18+02:00
metadata_version: v1
dependencies:
- mamba
- conda-lock
On the command line, you should specify --metadata=v1,about,platform,command,dependencies
or similar to select desired fields and specify their order.
Thinking a bit through some of the human consumable parts for this would we want something like this instead
metadata:
spec: explicit-1.0
description: ... # optional
channels:
- conda-forge
name: myenv
packages:
# probably will be alphabetically ordered
xyz:
linux-64:
version: 0.15.0
resolved: https://conda.anaconda.org/conda-forge/linux-64/xyz-0.15.0-had123.tar.bz2
sha256: 123123123123123sjadalkjdlkajsk
signature: ... ? # we need to also have certain metadata to validate signatures, though
osx-64:
version: 0.15.0
resolved: https://conda.anaconda.org/conda-forge/osx-64/xyz-0.15.0-had123.tar.bz2
sha256: 123123123123123sjadalkjdlkajsk
pip:
linux-64:
version: 1.15.0
resolved: https://conda.anaconda.org/conda-forge/noarch/pip-1.15.0-had123.tar.bz2
sha256: 123123123123123sjadalkjdlkajsk
abc:
osx-64:
version: 0.16.0
resolved: https://conda.anaconda.org/conda-forge/osx-64/abc-0.16.0-had123.tar.bz2
sha256: 123jk1lk23j1kl2j3kj12k3jlj1lk2
install_order:
linux-64:
- xyz
- pip
osx-64:
- xyz
- abc
Noarch packages will still be repeated per platform since there may be a platform specific variant of something that is usually noarch.
By grouping the packages together we make it easier to review updates to lockfiles as when you relock you can ensure that all the versions you expect to move, move.
I would also like to make it a superset of a regular yaml environment file, by the way.
you could also consider blake3 instead of sha256, I think it's much faster (parallel/multithreaded).
I don't think the hashing speed matters so much here. One nice thing about sha256 is that we can directly pull from a OCI registry with it.
I would also like to make it a superset of a regular yaml environment file, by the way.
I'm -1 on this. Lockfiles are generated by machines. The sources are generated by humans. When both a human and a machine edit the same file you're asking for trouble.
I'm -1 on this. Lockfiles are generated by machines. The sources are generated by humans. When both a human and a machine edit the same file you're asking for trouble.
@mariusvniekerk I know it's a bit dangerous, and I've been debating this point with myself for a while.
The fundamental problem that I'm trying to solve is as follows:
I'm working on some project, generate a lockfile for it, but then I forget to document how I generated the lockfile. I move on to something else. Some weeks/months later, I return to the project. I'd like to update the lockfile, but I forgot the exact conda-lock
command that I ran to generate it. So I have to study the documentation and recreate the command I used to create it.
Put in a different way, a lockfile is supposed to guarantee reproducibility of the environment. I think it would be great if the lockfile could also guarantee its own reproducibility updatability. (For reproducibility I opened https://github.com/mamba-org/mamba/issues/1214).
Oh i'm 100% for stuffing as much metadata into the lockfile as possible for reproducibility, but i do not want the ability to accidentally use a lockfile for something its not for.
Basically every language community around that has lockfiles as a core concept makes the output of the locking process as a separate file with its own dedicated format (cargo, go.mod, yarn, etc)
I'd like to put the environment file into the lockfile's metadata. As soon as this has been done, it becomes extremely tempting to edit that and/or use that copy of the environment file as a new basis for generating the lockfile.
Where do we draw the line? Do we say that we can include a copy of the environment file, but we refuse to acknowledge that copy as machine-readable?
All for dumping the source files into the metadata for the lock. It can even be machine readable. But once you have that be editable by a human bad things will happen.
I'm not sure I understand your -1 then...
Let's say we define our new lockfile format which includes the source environment.yaml
file. Then on the conda-lock side we implement conda-lock update environment.lock
.
Unless we somehow provide some deterrent, people will then naturally delete the original environment.yaml
file and edit the dependencies from environment.lock
. (It's a natural thing to do, especially to maintain a single source of truth.)
You say that bad things will now happen. What specifically, data corruption? How can we prevent/discourage those bad things?
What happens is that users will just edit the user-editable part of the lock file and not update it. At that point the lock is entirely a lie.
Thanks! Now I understand.
One potential mitigation for this problem could be to include a checksum based on the dependencies from which the lock was generated. Any program which installs a lockfile should verify this checksum. In case it doesn't match, scream "These dependencies are a lie!!!" and refuse to do anything until the lockfile is updated.
This would require the cooperation of any program which can install a lockfile. The programs I'm aware of are Conda, Mamba, Micromamba, and conda-lock. Among the two of you, we have pretty good coverage in here! :rofl:
I've just started to work on a cmake-micromamba
extension that will allow CMake users to directly call micromamba to create an environment -- and I realized that a lockfile will be quite useful for this! :)
I found my way here via a hint from @wolfv at PackagingCon, and would also like to see a richer, structured, multi-platform lock file. Here are some more fields that would be useful to include for each package, mostly to support extensions to conda-lock
.
To support optional subsets of packages that need to be mutually compatible, but that you may not want to install in some contexts (e.g. installing dev dependencies in CI, but only required dependencies in production):
optional: bool = False
category: str = "main"
(This is mostly relevant in the context of requirements parsed out of a pyproject.toml
)
To support pip
interoperability:
manager: Literal["conda", "pip"] = "conda"
(In pip mode, the url would point to a wheel or sdist rather than a conda package)
I think the format should be human-readable. To me, one of the key requirements for the lock file format should be that it's diffable. Lock files tend to become difficult to grasp but resolving conflicts on them should still be possible. Cargo went through a similar process I think we can learn from that!
That's really good point, thanks for pointing to the Cargo discussion!
FYI, here's an example of (and model for) what we settled on for conda-lock
after some back-and-forth with @mariusvniekerk and @wolfv. After skimming the Cargo thread, I think we've addressed most of the points raised there, namely:
Are there any other considerations we missed?
@jvansanten @mariusvniekerk one small nitpick I have would be that maybe instead of hash
it could be md5
OR sha256
(or both) as keys.
Or alternatively it could be hash: md5-xyz
or hash: sha256-xyz
or some similar format.
Or
hash:
sha256: xyz
md5: abc
@jvansanten Yes thanks!
Quick side question: Does conda-lock
also support minimal updates?
In order to make it truly human-readable, I'm of the opinion that the metadata should include a small amount of text to briefly explain what the lockfile is for, and also how to update it (possibly with a dynamically-generated update command).
What I have in mind is that people might encounter the lockfile who are not devops-savvy. They might not even understand Conda environments. I'd like to be able to reach such people.
For there to be any hope of my work colleagues adopting this, it needs to be extremely easy-to-use.
Maybe this is weird, but I'm also somewhat interested in reproducibility of the lockfile itself. The solution generated in the lockfile depends on the solver used, and also the input to the solver. I think it would be nice to include the relevant version numbers for the particular solver used...
As for the input to the solver, it is roughly the knowledge of available packages at a given time, possibly filtered by some sort of trust policy (which doesn't currently exist). Thus I'm also interested in including the time of solve.
Unfortunately the solver timestamp itself is not so reproducible. I have a few ideas for mitigating this...
@baszalmstra do you mean this from conda-incubator/conda-lock#131?
# To update a single package to the latest version compatible with the version constraints in the source:
# conda-lock lock --lockfile conda-lock.yml --update PACKAGE
Actually, due to my own ignorance I don't understand what exactly this means... Should it install the latest possible version of PACKAGE
which is compatible with the constraints in the source, while upgrading only those dependencies which are incompatible with the new version? (Alternatively you could for example try to do a fresh install of PACKAGE
in a new environment and then try to merge the existing environment with the new one.)
Haha, yes that was my question too. How does yarn, npm or cargo solve this?
Actually, due to my own ignorance I don't understand what exactly this means... Should it install the latest possible version of
PACKAGE
which is compatible with the constraints in the source, while upgrading only those dependencies which are incompatible with the new version?
This is what it does, yes. There's an example in the README, but if you have suggestions for how to explain this behavior more clearly, I'm all ears.
Ok, very interesting! As for the README, I explained some thoughts in #129. In general, I hope that I can turn my ignorance into high-quality documentation. :) But before I can help to write anything sensible, I feel like I need to understand better what's going on.
I actually have a followup question since my previous question was very imprecise. Namely, I phrased it as if PACKAGE
had no subdependencies. Can you explain briefly what sort of solving algorithm is used? (What I'm trying to get at is that since dependencies can be deeply interconnected, how do you restrict the solver from simply upgrading all packages?)
For instance, would updating a package be equivalent to iterating on the PACKAGE
versions ver
from latest to current, and then running conda | mamba install PACKAGE=ver
until it succeeds? (BTW, is there such a conda
command for upgrading to the latest installable version of a single package?) I realize now that I don't even know what algorithms conda/mamba use for this... :man_facepalming: (I suppose I never had the opportunity to ask before now.)
@jvansanten is the idea of the manager: ...
key to memorize which package manager installed what package?
An important information could be to record "requested specs". conda does this implicitly in the $PREFIX/conda-meta/history
file.
However, I am not sure if we have this knowledge for pip.
Requested specs would be nice so that we can "prune" the environment later on (even for other package managers) -- e.g. remove not-anymore-requested specs (as asked here: https://github.com/mamba-org/mamba/issues/1333)
For instance, would updating a package be equivalent to iterating on the
PACKAGE
versionsver
from latest to current, and then runningconda | mamba install PACKAGE=ver
until it succeeds?
@maresb @jvansanten
From this comment in the source code, what I've gleaned is that:
What I still don't understand is what happens when other packages are not compatible with the newly updated package. I see a few things that could be done:
I don't think there's an obvious best way to do this. It's a hard problem that has definitely been grappled with before, and there are some pretty convoluted solutions. For example, Ruby's package manager Bundler updates to the latest if the spec isn't compatible OR there are no transitive dependencies.
I see --update
as the key feature of conda-lock. If the updating algorithm was good, easy enough to understand, and clearly stated in the documentation, it would be a huge win.
Requested specs would be nice so that we can "prune" the environment later on
My understanding of @maresb's request for including the source environment.yaml
is for exactly this reason. The lockfile will include all transitive dependencies which the author of the environment.yaml
might not care about.
The original environment.yaml
is the requested spec and the lockfile is the fulfilment of that spec at a given point in time. If a user wants to update an environment created from a lockfile they will always carry around all the transitive dependencies, even if they're no longer required by the updated requested dependencies.
If the user was able to update the lockfile from the original requested spec (environment.yaml
) they they could prune no longer required transitive dependencies from their environment. By including the source environment.yaml
in the lockfile metadata it ensures the lockfile can be updated/recreated from the original spec.
You could then also allow adding new packages to the requested deps or specifying other constraints when updating the lockfile - e.g.
conda-lock update --file=env.lock 'pandas>=1.4' mynewpackage
What is the status of this feature? https://github.com/mamba-org/mamba/pull/1577 seems to have implemented them but my attempts to use
micromamba install -f conda-lock.yml -n testing2
result in
__
__ ______ ___ ____ _____ ___ / /_ ____ _
/ / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
/ /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
/ .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
/_/
warning libmamba 'root_prefix' set with default value: /Users/samanthahughes/micromamba
Transaction
Prefix: /Users/samanthahughes/micromamba/envs/testing2
Nothing to do
Transaction starting
Transaction finished
Micromamba 0.25
@shughes-uk I think the file has to end with .lock
@shughes-uk I think the file has to end with
.lock
Mamba hits me with
EnvironmentFileExtensionNotValid: '/Users/samanthahughes/programming/cloud/conda-lock.lock' file extension must be one of '.txt', '.yaml' or '.yml'
micromamba hits me with
__
__ ______ ___ ____ _____ ___ / /_ ____ _
/ / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
/ /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
/ .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
/_/
warning libmamba 'root_prefix' set with default value: /Users/samanthahughes/micromamba
critical libmamba Invalid spec, no package name found:
Here's the lockfile (renamed conda-lock.txt from conda-lock.lock so github would let me upload it)
Should've taken a proper look before -- ...-lock.yml
or ...-lock.yaml
is the magic ending.
I was using conda-lock.yml
in my first attempt. Tried .yaml
for good luck but still no joy.
What platform are you on? E.g. I tried your lockfile on a M1 mac, and nothing got done (becuase there are no packages in the lockfile for this platform).
Ahhh that would do it. Thank you!! Perhaps a nice fix here would have the lack of a relevant platform section be an error code instead?
I've been trying to make serious use of the new lock file format lately. I've encountered various issues with Micromamba's implementation, all of which I've reported.
I think I've also discovered a logical oversight in the specification itself, namely with the category field:
category: str = "main"
Problem: This construction requires that each package belongs to a single category, but packages should be able to belong to multiple categories. (For instance, I want pip
to be both a main
and dev
requirement.)
Suggestion: Convert this to
categories: list[str] = ["main"]
Background: As a refresher, we can have multiple environment files, for instance environment-main.yml
and environment-dev.yml
. In environment-dev.yml
, I can add a top-level category: dev
entry. Now when I run conda-lock -f environment-main.yml -f environment-dev.yml
, each resulting dev package entry in conda-lock.yml
will inherit category: dev
.
Let's suppose I'm developing a containerized app. I have a devcontainer where I install both main
and dev
dependencies, and I have a production container where I install only the main
dependencies. In order to be able to run pip install
for setting up the production container, I need Pip to be installed. However, if I list pip
in environment-dev.yml
, it acquires category: dev
, and thus it will not be installed in my production container.
As a workaround, I have to remove pip
from my environment-dev.yml
. But I think I should be able to leave pip
in both environment files.
Question: Does my suggestion make sense, or am I somehow thinking about this in the wrong way?
Thanks!
One way to solve for dev-only, prod-only, and both-dev-and-prod would be to have 3 different categories. But that will require 2^n categories, generally speaking. I agree that it would be nice to be able to attach multiple categories.
~We should probably also include a field for the schema version so that we can recognize when the lockfile's consumer needs to be updated.~
I also agree that multiple categories would be handy. All the dependency solvers I'm aware of demand that the total solution is self-consistent, i.e. if you install all packages, you should get no conflicts. The same should be true of any subset, and there's no reason those subsets need to be disjoint, other than the fact that some solvers like poetry treat them that way.
It would be great to have a new lockfile format. The current conda lockfile format (explicit env format) has quite a bunch of shortcomings: it's a weird ad-hoc format and only supports MD5 sums (and not even by default, I think & SHA256 is much better). The command to export an explicit environment in conda is
conda list --explicit [--md5]
Micromamba already improves on this by changing the command to
micromamba env export --explicit [--no-md5]
(ie. it uses theenv
subcommand and defaults to add--md5
hashes).I am thinking it would be nice to replace this with a proper YAML based format.
I am proposing something like:
The explicit packages would contain a list per (supported) subdir. The list would be the full env resolution (including noarch pacakges) and in the correct order for installation (as current lockfiles today).