Create frozen Conda environments for modules

pinin4fjords commented 1 year ago

Description of feature

Problem

Conda environments are not reproducible over time. The sometimes large dependency trees mean you get a different software stack next week to the one you have today. This is bad for reproducible science.

The often used workaround for this has been to use Docker images, which have the effect of freezing dependency trees, but then if you find yourself rebuilding Docker images (e.g. to patch due to security concerns) you lose those frozen dependencies. Some (e.g. Paolo, I think) would say that really, we should be using Docker as a software delivery mechanism only.

A better way of doing this is to actually record the state of the environment when modules are created, and when the conda dependencies are updated, creating a frozen dependencies file that can be used to create environments when the workflows are run.

Available solutions

pythonspeed has an excellent (if not quite up to date) summary of this.

Essentially there are two ways to go.

`conda env export`

Create the environments, immediately record their state.

Advantages: no extra software required
Disadvantages: would be difficult for developers to do on a single machine in order to generate the separate environments that would be required for e.g. MacOS and Linux. Maybe it could be done with different machines in CI?

`conda-lock`

See https://github.com/conda/conda-lock.

Advantages:
- can make multi-platform lock files
- Bypasses the conda solver (you're basically just storing a list of URIs to the package archives). That could speed things up significantly.
Disadvantages
- Requires more software
- Users would need to install conda-lock to re-create environments at run time.

How I imagine the tools commands working

I don't know how we might persuade Nextflow itself to use lock files to create the environments from lock files at run time. So imagine a different sequence:

nf-core modules conda-lock - Runs conda-lock, creates lockfiles for all architectures required nf-core init-locked_envs - Creates environments for all the lockfiles for all the the modules of a workflow that have them.

Then, when the workflow is run, the module environments are all recognised as being in place, and off we go. This could work incrementally, such that environments were still created on the fly for modules lacking lock files.

Potential problems

Rebuilding lock files when conda packages were bumped.
CI to ensure the above.
There may be some overlap with all the new funky Wave stuff

pinin4fjords commented 1 year ago

See also Paolo's post in #bioconda https://nfcore.slack.com/archives/CM46YC6BZ/p1677007405615889

pinin4fjords commented 1 year ago

See also discussion

edmundmiller commented 7 months ago

I believe wave supports conda-lock files now!

My issue would be with readability on the environment.yml. I kinda just want to see what exactly we want and not the 100 dependencies.

pinin4fjords commented 7 months ago

@Emiller88 maybe we need a an environment-lock.yml in addition to the environment.yml? I know, another file, but would serve the different use cases of complete reproducibility vs flexible environment solve.

Would get messy with different architectures though...

edmundmiller commented 7 months ago

Maybe a .conda directory to keep it cleaner?

I think it's a trade-off at the end of the day.

If you want to be sure about reproducibility, you use the container images.

If you want to roll the dice, use conda. It'll get you pretty close 95% of the time.

pinin4fjords commented 7 months ago

See where you're coming from, don't completely agree.

I should be able to inspect the package complement of a frozen software env without poking about in a Docker image, and in an ideal world I'd like to be able to tweak an env to add something simple without rebuilding the whole thing (though since new thing may have its own deps I appreciate that's not a given).

edmundmiller commented 2 weeks ago

I think this was in a time before tests/ and everything else in a modules directory. I think having both and environment.yml and environment.lock.yml isn't ridiculous at this point.

My issue is if they'll get updated and maintained.

I think we can automate this now.

Bump the environment.yml -> Create a lock file -> Pass the lock file to wave

ewels commented 2 weeks ago

maybe we need a an environment-lock.yml in addition to the environment.yml

Same as package.json and package-lock.json for npm. This is what I'd expect for conda lockfiles tbh.

Automation as @edmundmiller says FTW 👍🏻

nf-core / tools