nf-core / tools

Python package with helper tools for the nf-core community.
https://nf-co.re
MIT License
217 stars 182 forks source link

Create frozen Conda environments for modules #2193

Open pinin4fjords opened 1 year ago

pinin4fjords commented 1 year ago

Description of feature

Problem

Conda environments are not reproducible over time. The sometimes large dependency trees mean you get a different software stack next week to the one you have today. This is bad for reproducible science.

The often used workaround for this has been to use Docker images, which have the effect of freezing dependency trees, but then if you find yourself rebuilding Docker images (e.g. to patch due to security concerns) you lose those frozen dependencies. Some (e.g. Paolo, I think) would say that really, we should be using Docker as a software delivery mechanism only.

A better way of doing this is to actually record the state of the environment when modules are created, and when the conda dependencies are updated, creating a frozen dependencies file that can be used to create environments when the workflows are run.

Available solutions

pythonspeed has an excellent (if not quite up to date) summary of this.

Essentially there are two ways to go.

conda env export

Create the environments, immediately record their state.

conda-lock

See https://github.com/conda/conda-lock.

How I imagine the tools commands working

I don't know how we might persuade Nextflow itself to use lock files to create the environments from lock files at run time. So imagine a different sequence:

nf-core modules conda-lock - Runs conda-lock, creates lockfiles for all architectures required nf-core init-locked_envs - Creates environments for all the lockfiles for all the the modules of a workflow that have them.

Then, when the workflow is run, the module environments are all recognised as being in place, and off we go. This could work incrementally, such that environments were still created on the fly for modules lacking lock files.

Potential problems

pinin4fjords commented 1 year ago

See also Paolo's post in #bioconda https://nfcore.slack.com/archives/CM46YC6BZ/p1677007405615889

pinin4fjords commented 1 year ago

See also discussion

edmundmiller commented 7 months ago

I believe wave supports conda-lock files now!

My issue would be with readability on the environment.yml. I kinda just want to see what exactly we want and not the 100 dependencies.

pinin4fjords commented 7 months ago

@Emiller88 maybe we need a an environment-lock.yml in addition to the environment.yml? I know, another file, but would serve the different use cases of complete reproducibility vs flexible environment solve.

Would get messy with different architectures though...

edmundmiller commented 7 months ago

Maybe a .conda directory to keep it cleaner?

I think it's a trade-off at the end of the day.

If you want to be sure about reproducibility, you use the container images.

If you want to roll the dice, use conda. It'll get you pretty close 95% of the time.

pinin4fjords commented 7 months ago

See where you're coming from, don't completely agree.

I should be able to inspect the package complement of a frozen software env without poking about in a Docker image, and in an ideal world I'd like to be able to tweak an env to add something simple without rebuilding the whole thing (though since new thing may have its own deps I appreciate that's not a given).

edmundmiller commented 2 weeks ago

I think this was in a time before tests/ and everything else in a modules directory. I think having both and environment.yml and environment.lock.yml isn't ridiculous at this point.

My issue is if they'll get updated and maintained.

I think we can automate this now.

Bump the environment.yml -> Create a lock file -> Pass the lock file to wave

ewels commented 2 weeks ago

maybe we need a an environment-lock.yml in addition to the environment.yml

Same as package.json and package-lock.json for npm. This is what I'd expect for conda lockfiles tbh.

Automation as @edmundmiller says FTW 👍🏻