New lock file format - Githubissues

wolfv commented 2 years ago

It would be great to have a new lockfile format. The current conda lockfile format (explicit env format) has quite a bunch of shortcomings: it's a weird ad-hoc format and only supports MD5 sums (and not even by default, I think & SHA256 is much better). The command to export an explicit environment in conda is conda list --explicit [--md5]

Micromamba already improves on this by changing the command to micromamba env export --explicit [--no-md5] (ie. it uses the env subcommand and defaults to add --md5 hashes).

# This file may be used to create an environment using:
# $ conda create --name <env> --file <this file>
# platform: osx-arm64
@EXPLICIT
https://conda.anaconda.org/conda-forge/osx-arm64/argp-standalone-1.3-h3422bc3_0.tar.bz2#b744f29f1ef63fcedcb63b45d7ceed4a
https://conda.anaconda.org/conda-forge/osx-arm64/git-lfs-2.13.3-hce30654_0.tar.bz2#d0a6dda324b5d970c296dca838965193
...

I am thinking it would be nice to replace this with a proper YAML based format.

I am proposing something like:

metadata:
  spec: explicit-1.0
  description: ... # optional

name: myenv
# channels: ?
explicit-packages:
  linux-64:
  - name: xyz
    version: 0.15.0
    resolved: https://conda.anaconda.org/conda-forge/linux-64/xyz-0.15.0-had123.tar.bz2
    sha256: 123123123123123sjadalkjdlkajsk
    signature: ... ? # we need to also have certain metadata to validate signatures, though
  - name: pip
    version: 1.15.0
    resolved: https://conda.anaconda.org/conda-forge/noarch/pip-1.15.0-had123.tar.bz2
    sha256: 123123123123123sjadalkjdlkajsk
  osx-64:
  - name: xyz
    version: 0.15.0
    resolved: https://conda.anaconda.org/conda-forge/osx-64/xyz-0.15.0-had123.tar.bz2
    sha256: 123123123123123sjadalkjdlkajsk
  - name: abc
    version: 0.16.0
    resolved: https://conda.anaconda.org/conda-forge/osx-64/abc-0.16.0-had123.tar.bz2
    sha256: 123jk1lk23j1kl2j3kj12k3jlj1lk2

The explicit packages would contain a list per (supported) subdir. The list would be the full env resolution (including noarch pacakges) and in the correct order for installation (as current lockfiles today).

mariusvniekerk commented 2 years ago

This WIP PR https://github.com/conda-incubator/conda-lock/pull/106 has a bunch of handy things that could also go into metadata

mariusvniekerk commented 2 years ago

This lockfile spec should have a version and ideally a reference to some standard jsonschema representation of the structure.

version: 1
$schema: https://some/url/for/schema_v1.json

maresb commented 2 years ago

Ooh, this is very exciting!!!

My thoughts regarding the lockfile metadata are that I'd like my lockfiles to be self-documenting. For instance, I'd like them to know how they were created, for example with which command. I'd also like to be able to add "comments" as explanation for colleagues. That way by looking at the lockfile, it'll be obvious what it is, where it came from, and how to update it.

I think it's useful to be able to choose which fields to include or exclude. For instance, some people may find it useful to include the timestamp and the username, but others might find the timestamp annoying with git, or might not want to leak their username.

I have also realized that the feature I'd really like is to be able to stick my "non-explicit" dependencies in the metadata so that I can run a command like conda-lock update environment.lock and have it rerun the solver and upgrade the lockfile in-place.

I was not bold enough to propose a new format, so I have been working with a header consisting of commented yaml. Obviously a proper YAML format like this would be better.

I have things mostly implemented; I'm just working on a system for versioning the metadata generation process so that it can be extended or modified in the future. I wanted to finish it this weekend but I ended up not having enough time.

wolfv commented 2 years ago

Yeah, we could make this format a super set of the existing YAML environment files. So that you could have

name: ...
metadata:
...
channels:
- conda-forge
- bioconda
dependencies:
- abc >0.5
- xyz =1.15
package-lock:
  linux-64:
     - ...

then some micromamba command could automatically update the lockfiles + certain metadata keys.

maresb commented 2 years ago

~I think I already tested at some point whether or not conda would accept an extra top-level item in the environment file (like metadata), and unfortunately it didn't work. Thus we're unfortunately looking at a breaking change. (But who cares about conda anymore? :wink:)~

EDIT: Or was it mamba that didn't work??? Sorry, I take it back. I'm not sure anymore...

wolfv commented 2 years ago

hmm, I don't think micromamba would complain about extra keys. With mamba, we're just using conda code though, so it might happen, IDK! :)

maresb commented 2 years ago

For a normal environment file it seems to produce a warning.

$ conda env create -n testenv --file=env.yaml

EnvironmentSectionNotValid: The following section on '/env.yaml' is invalid and will be ignored:
 - metadata

For an explicit lockfile, I'm getting:

CondaValueError: invalid package specification: metadata: asdf

wolfv commented 2 years ago

Yeah, in explicit lockfiles you need to use a comment # metadata: whatever...

mariusvniekerk commented 2 years ago

@maresb yeah, none of this exists yet, we're designing what we want from a lockfile format for conda.

maresb commented 2 years ago

Here is a summary of what I came up with in my PR:

conda-lock-metadata:
  about: This lockfile was generated by conda-lock to ensure reproducibility.
  comment: |-
    Run the following command to update this project's dependencies.
  command: conda-lock -f environment.yml --metadata=all
  command_with_path: /root/conda/envs/conda-lock/bin/conda-lock -f environment.yml --metadata=all
  conda_lock_version: 0.11.3.dev0+gf2ba8d4.d20210904
  created_by: root
  input_hash: f15a045753a401da73dd7c1693fd031e0ad41c0b4c9ca8545c0a8ab56c21d16c
  platform: win-64
  timestamp: 2021-09-05 23:43:18+02:00
  metadata_version: v1
  dependencies:
    - mamba
    - conda-lock

On the command line, you should specify --metadata=v1,about,platform,command,dependencies or similar to select desired fields and specify their order.

mariusvniekerk commented 2 years ago

Thinking a bit through some of the human consumable parts for this would we want something like this instead

metadata:
  spec: explicit-1.0
  description: ... # optional
  channels:
    - conda-forge

name: myenv
packages:
  # probably will be alphabetically ordered
  xyz:
    linux-64:
      version: 0.15.0
      resolved: https://conda.anaconda.org/conda-forge/linux-64/xyz-0.15.0-had123.tar.bz2
      sha256: 123123123123123sjadalkjdlkajsk
      signature: ... ? # we need to also have certain metadata to validate signatures, though
    osx-64:
      version: 0.15.0
      resolved: https://conda.anaconda.org/conda-forge/osx-64/xyz-0.15.0-had123.tar.bz2
      sha256: 123123123123123sjadalkjdlkajsk
  pip:
    linux-64:
      version: 1.15.0
      resolved: https://conda.anaconda.org/conda-forge/noarch/pip-1.15.0-had123.tar.bz2
      sha256: 123123123123123sjadalkjdlkajsk
  abc:
    osx-64:
      version: 0.16.0
      resolved: https://conda.anaconda.org/conda-forge/osx-64/abc-0.16.0-had123.tar.bz2
      sha256: 123jk1lk23j1kl2j3kj12k3jlj1lk2
install_order:
  linux-64:
    - xyz
    - pip
  osx-64:
    - xyz
    - abc

Noarch packages will still be repeated per platform since there may be a platform specific variant of something that is usually noarch.

By grouping the packages together we make it easier to review updates to lockfiles as when you relock you can ensure that all the versions you expect to move, move.

wolfv commented 2 years ago

I would also like to make it a superset of a regular yaml environment file, by the way.

maartenbreddels commented 2 years ago

you could also consider blake3 instead of sha256, I think it's much faster (parallel/multithreaded).

wolfv commented 2 years ago

I don't think the hashing speed matters so much here. One nice thing about sha256 is that we can directly pull from a OCI registry with it.

mariusvniekerk commented 2 years ago

I would also like to make it a superset of a regular yaml environment file, by the way.

I'm -1 on this. Lockfiles are generated by machines. The sources are generated by humans. When both a human and a machine edit the same file you're asking for trouble.

maresb commented 2 years ago

I'm -1 on this. Lockfiles are generated by machines. The sources are generated by humans. When both a human and a machine edit the same file you're asking for trouble.

@mariusvniekerk I know it's a bit dangerous, and I've been debating this point with myself for a while.

I feel convinced that there could be a substantial benefit to having everything in the same file. The benefit is that it's more intuitive to have everything in one place.
On the other hand, I think the confusion of having separate files could be mostly mitigated by explanatory comments and metadata. (That was the motivation behind my PR.)

The fundamental problem that I'm trying to solve is as follows:

I'm working on some project, generate a lockfile for it, but then I forget to document how I generated the lockfile. I move on to something else. Some weeks/months later, I return to the project. I'd like to update the lockfile, but I forgot the exact conda-lock command that I ran to generate it. So I have to study the documentation and recreate the command I used to create it.

Put in a different way, a lockfile is supposed to guarantee reproducibility of the environment. I think it would be great if the lockfile could also guarantee its own ~~reproducibility~~ updatability. (For reproducibility I opened https://github.com/mamba-org/mamba/issues/1214).

mariusvniekerk commented 2 years ago

Oh i'm 100% for stuffing as much metadata into the lockfile as possible for reproducibility, but i do not want the ability to accidentally use a lockfile for something its not for.

Basically every language community around that has lockfiles as a core concept makes the output of the locking process as a separate file with its own dedicated format (cargo, go.mod, yarn, etc)

maresb commented 2 years ago

I'd like to put the environment file into the lockfile's metadata. As soon as this has been done, it becomes extremely tempting to edit that and/or use that copy of the environment file as a new basis for generating the lockfile.

Where do we draw the line? Do we say that we can include a copy of the environment file, but we refuse to acknowledge that copy as machine-readable?

mariusvniekerk commented 2 years ago

All for dumping the source files into the metadata for the lock. It can even be machine readable. But once you have that be editable by a human bad things will happen.

maresb commented 2 years ago

I'm not sure I understand your -1 then...

Let's say we define our new lockfile format which includes the source environment.yaml file. Then on the conda-lock side we implement conda-lock update environment.lock.

Unless we somehow provide some deterrent, people will then naturally delete the original environment.yaml file and edit the dependencies from environment.lock. (It's a natural thing to do, especially to maintain a single source of truth.)

You say that bad things will now happen. What specifically, data corruption? How can we prevent/discourage those bad things?

mariusvniekerk commented 2 years ago

What happens is that users will just edit the user-editable part of the lock file and not update it. At that point the lock is entirely a lie.

maresb commented 2 years ago

Thanks! Now I understand.

One potential mitigation for this problem could be to include a checksum based on the dependencies from which the lock was generated. Any program which installs a lockfile should verify this checksum. In case it doesn't match, scream "These dependencies are a lie!!!" and refuse to do anything until the lockfile is updated.

This would require the cooperation of any program which can install a lockfile. The programs I'm aware of are Conda, Mamba, Micromamba, and conda-lock. Among the two of you, we have pretty good coverage in here! :rofl:

wolfv commented 2 years ago

I've just started to work on a cmake-micromamba extension that will allow CMake users to directly call micromamba to create an environment -- and I realized that a lockfile will be quite useful for this! :)

jvansanten commented 2 years ago

I found my way here via a hint from @wolfv at PackagingCon, and would also like to see a richer, structured, multi-platform lock file. Here are some more fields that would be useful to include for each package, mostly to support extensions to conda-lock.

To support optional subsets of packages that need to be mutually compatible, but that you may not want to install in some contexts (e.g. installing dev dependencies in CI, but only required dependencies in production):

optional: bool = False
category: str = "main"

(This is mostly relevant in the context of requirements parsed out of a pyproject.toml)

To support pip interoperability:

manager: Literal["conda", "pip"] = "conda"

(In pip mode, the url would point to a wheel or sdist rather than a conda package)

baszalmstra commented 2 years ago

I think the format should be human-readable. To me, one of the key requirements for the lock file format should be that it's diffable. Lock files tend to become difficult to grasp but resolving conflicts on them should still be possible. Cargo went through a similar process I think we can learn from that!

jvansanten commented 2 years ago

That's really good point, thanks for pointing to the Cargo discussion!

FYI, here's an example of (and model for) what we settled on for conda-lock after some back-and-forth with @mariusvniekerk and @wolfv. After skimming the Cargo thread, I think we've addressed most of the points raised there, namely:

It's now YAML, so fairly readable
Package entries are sorted by (manager, name, platform), so updates to e.g. a single conda package will be a contiguous diff, even if it spans multiple platforms
Package hashes are in the package entry
The metadata section depends only on the dependency specification, so is stable under lock refreshes

Are there any other considerations we missed?

wolfv commented 2 years ago

@jvansanten @mariusvniekerk one small nitpick I have would be that maybe instead of hash it could be md5 OR sha256 (or both) as keys.

Or alternatively it could be hash: md5-xyz or hash: sha256-xyz or some similar format.

Or

hash:
  sha256: xyz
  md5: abc

baszalmstra commented 2 years ago

@jvansanten Yes thanks!

Quick side question: Does conda-lock also support minimal updates?

maresb commented 2 years ago

In order to make it truly human-readable, I'm of the opinion that the metadata should include a small amount of text to briefly explain what the lockfile is for, and also how to update it (possibly with a dynamically-generated update command).

What I have in mind is that people might encounter the lockfile who are not devops-savvy. They might not even understand Conda environments. I'd like to be able to reach such people.

For there to be any hope of my work colleagues adopting this, it needs to be extremely easy-to-use.

maresb commented 2 years ago

Maybe this is weird, but I'm also somewhat interested in reproducibility of the lockfile itself. The solution generated in the lockfile depends on the solver used, and also the input to the solver. I think it would be nice to include the relevant version numbers for the particular solver used...

As for the input to the solver, it is roughly the knowledge of available packages at a given time, possibly filtered by some sort of trust policy (which doesn't currently exist). Thus I'm also interested in including the time of solve.

Unfortunately the solver timestamp itself is not so reproducible. I have a few ideas for mitigating this...

On an update, only update the timestamp if the dependencies actually change. This makes the update operation idempotent over short time intervals. The disadvantage is that there could be a merge conflict if two people generate equivalent lockfiles in parallel.
Include the upload timestamp for all included packages. Then the "solver timestamp" could be taken to be the max timestamp over all included packages. That way there is no user-generated timestamp, and one could roughly (there are complications and exceptions) reproduce the relevant available packages with a simple time filter.

maresb commented 2 years ago

@baszalmstra do you mean this from conda-incubator/conda-lock#131?

# To update a single package to the latest version compatible with the version constraints in the source:
#     conda-lock lock --lockfile conda-lock.yml --update PACKAGE

Actually, due to my own ignorance I don't understand what exactly this means... Should it install the latest possible version of PACKAGE which is compatible with the constraints in the source, while upgrading only those dependencies which are incompatible with the new version? (Alternatively you could for example try to do a fresh install of PACKAGE in a new environment and then try to merge the existing environment with the new one.)

baszalmstra commented 2 years ago

Haha, yes that was my question too. How does yarn, npm or cargo solve this?

jvansanten commented 2 years ago

Actually, due to my own ignorance I don't understand what exactly this means... Should it install the latest possible version of PACKAGE which is compatible with the constraints in the source, while upgrading only those dependencies which are incompatible with the new version?

This is what it does, yes. There's an example in the README, but if you have suggestions for how to explain this behavior more clearly, I'm all ears.

maresb commented 2 years ago

Ok, very interesting! As for the README, I explained some thoughts in #129. In general, I hope that I can turn my ignorance into high-quality documentation. :) But before I can help to write anything sensible, I feel like I need to understand better what's going on.

I actually have a followup question since my previous question was very imprecise. Namely, I phrased it as if PACKAGE had no subdependencies. Can you explain briefly what sort of solving algorithm is used? (What I'm trying to get at is that since dependencies can be deeply interconnected, how do you restrict the solver from simply upgrading all packages?)

For instance, would updating a package be equivalent to iterating on the PACKAGE versions ver from latest to current, and then running conda | mamba install PACKAGE=ver until it succeeds? (BTW, is there such a conda command for upgrading to the latest installable version of a single package?) I realize now that I don't even know what algorithms conda/mamba use for this... :man_facepalming: (I suppose I never had the opportunity to ask before now.)

wolfv commented 2 years ago

@jvansanten is the idea of the manager: ... key to memorize which package manager installed what package?

An important information could be to record "requested specs". conda does this implicitly in the $PREFIX/conda-meta/history file. However, I am not sure if we have this knowledge for pip.

Requested specs would be nice so that we can "prune" the environment later on (even for other package managers) -- e.g. remove not-anymore-requested specs (as asked here: https://github.com/mamba-org/mamba/issues/1333)

zmbc commented 2 years ago

For instance, would updating a package be equivalent to iterating on the PACKAGE versions ver from latest to current, and then running conda | mamba install PACKAGE=ver until it succeeds?

@maresb @jvansanten

From this comment in the source code, what I've gleaned is that:

Updating a package always updates that package to the latest version compatible with your spec.
Dependencies of that package, if still compatible, are not changed.
Packages that depend on the updated package, if still compatible, are not changed.

What I still don't understand is what happens when other packages are not compatible with the newly updated package. I see a few things that could be done:

Update them to the latest compatible version. If even more packages need to be updated now, so be it.
Update them to the most recent version that does not require updating any other packages (or, if there is no such version, the most recent that does not require updating two packages, and so on).
Update them the minimum amount required by the new spec. Or maybe the minimum major.minor version, then apply algorithm 2 to the patch versions.

I don't think there's an obvious best way to do this. It's a hard problem that has definitely been grappled with before, and there are some pretty convoluted solutions. For example, Ruby's package manager Bundler updates to the latest if the spec isn't compatible OR there are no transitive dependencies.

I see --update as the key feature of conda-lock. If the updating algorithm was good, easy enough to understand, and clearly stated in the documentation, it would be a huge win.

dhirschfeld commented 2 years ago

Requested specs would be nice so that we can "prune" the environment later on

My understanding of @maresb's request for including the source environment.yaml is for exactly this reason. The lockfile will include all transitive dependencies which the author of the environment.yaml might not care about.

The original environment.yaml is the requested spec and the lockfile is the fulfilment of that spec at a given point in time. If a user wants to update an environment created from a lockfile they will always carry around all the transitive dependencies, even if they're no longer required by the updated requested dependencies.

If the user was able to update the lockfile from the original requested spec (environment.yaml) they they could prune no longer required transitive dependencies from their environment. By including the source environment.yaml in the lockfile metadata it ensures the lockfile can be updated/recreated from the original spec.

dhirschfeld commented 2 years ago

You could then also allow adding new packages to the requested deps or specifying other constraints when updating the lockfile - e.g.

conda-lock update --file=env.lock 'pandas>=1.4' mynewpackage

shughes-uk commented 2 years ago

What is the status of this feature? https://github.com/mamba-org/mamba/pull/1577 seems to have implemented them but my attempts to use

micromamba install -f conda-lock.yml -n testing2

result in


                                           __
          __  ______ ___  ____ _____ ___  / /_  ____ _
         / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
        / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
       / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
      /_/

warning  libmamba 'root_prefix' set with default value: /Users/samanthahughes/micromamba
Transaction

  Prefix: /Users/samanthahughes/micromamba/envs/testing2

  Nothing to do

Transaction starting
Transaction finished

Micromamba 0.25

wolfv commented 2 years ago

@shughes-uk I think the file has to end with .lock

shughes-uk commented 2 years ago

@shughes-uk I think the file has to end with .lock

Mamba hits me with

EnvironmentFileExtensionNotValid: '/Users/samanthahughes/programming/cloud/conda-lock.lock' file extension must be one of '.txt', '.yaml' or '.yml'

micromamba hits me with


                                           __
          __  ______ ___  ____ _____ ___  / /_  ____ _
         / / / / __ `__ \/ __ `/ __ `__ \/ __ \/ __ `/
        / /_/ / / / / / / /_/ / / / / / / /_/ / /_/ /
       / .___/_/ /_/ /_/\__,_/_/ /_/ /_/_.___/\__,_/
      /_/

warning  libmamba 'root_prefix' set with default value: /Users/samanthahughes/micromamba
critical libmamba Invalid spec, no package name found:

shughes-uk commented 2 years ago

Here's the lockfile (renamed conda-lock.txt from conda-lock.lock so github would let me upload it)

conda-lock.txt

wolfv commented 2 years ago

Should've taken a proper look before -- ...-lock.yml or ...-lock.yaml is the magic ending.

shughes-uk commented 2 years ago

I was using conda-lock.yml in my first attempt. Tried .yaml for good luck but still no joy.

wolfv commented 2 years ago

What platform are you on? E.g. I tried your lockfile on a M1 mac, and nothing got done (becuase there are no packages in the lockfile for this platform).

shughes-uk commented 2 years ago

Ahhh that would do it. Thank you!! Perhaps a nice fix here would have the lack of a relevant platform section be an error code instead?

maresb commented 2 years ago

I've been trying to make serious use of the new lock file format lately. I've encountered various issues with Micromamba's implementation, all of which I've reported.

I think I've also discovered a logical oversight in the specification itself, namely with the category field:

category: str = "main"

Problem: This construction requires that each package belongs to a single category, but packages should be able to belong to multiple categories. (For instance, I want pip to be both a main and dev requirement.)

Suggestion: Convert this to

categories: list[str] = ["main"]

Background: As a refresher, we can have multiple environment files, for instance environment-main.yml and environment-dev.yml. In environment-dev.yml, I can add a top-level category: dev entry. Now when I run conda-lock -f environment-main.yml -f environment-dev.yml, each resulting dev package entry in conda-lock.yml will inherit category: dev.

Let's suppose I'm developing a containerized app. I have a devcontainer where I install both main and dev dependencies, and I have a production container where I install only the main dependencies. In order to be able to run pip install for setting up the production container, I need Pip to be installed. However, if I list pip in environment-dev.yml, it acquires category: dev, and thus it will not be installed in my production container.

As a workaround, I have to remove pip from my environment-dev.yml. But I think I should be able to leave pip in both environment files.

Question: Does my suggestion make sense, or am I somehow thinking about this in the wrong way?

Thanks!

jonashaag commented 2 years ago

One way to solve for dev-only, prod-only, and both-dev-and-prod would be to have 3 different categories. But that will require 2^n categories, generally speaking. I agree that it would be nice to be able to attach multiple categories.

maresb commented 2 years ago

~We should probably also include a field for the schema version so that we can recognize when the lockfile's consumer needs to be updated.~

jvansanten commented 2 years ago

I also agree that multiple categories would be handy. All the dependency solvers I'm aware of demand that the total solution is self-consistent, i.e. if you install all packages, you should get no conflicts. The same should be true of any subset, and there's no reason those subsets need to be disjoint, other than the fact that some solvers like poetry treat them that way.

mamba-org / mamba

New lock file format #1209