Migrate modules to Seqera Containers

ewels commented 5 months ago

Seqera Containers is a new service to provide Docker + Singularity containers from any Conda / PyPI packages. Images are generated on demand and can include multiple packages.

### Tasks
- [ ] https://github.com/nf-core/modules/issues/6698
- [ ] https://github.com/nf-core/modules/issues/6694
- [ ] https://github.com/nf-core/modules/pull/4940
- [ ] https://github.com/nf-core/modules/issues/4306

Links for background:

We should strip out the BioContainers quay.io docker images + Galaxy server Singularity images and replace with image URIs from Seqera Containers.

We should use this change as an opportunity to rethink the optimal code structure for defining image names. This is currently under discussion. Group consensus can be posted here once achieved for broader community approval.

Milestone to track broad progress on this update: https://github.com/nf-core/modules/milestone/6

subwaystation commented 5 months ago

It is still unclear to me, that images generated on demand should not harm reproducibility.

subwaystation commented 5 months ago

Was this ever evaluated?

subwaystation commented 5 months ago

This will also increase the used CPU hours, right? Also, it feels a bit like Seqera will have a monopoly of containers. I am aware that Seqera did an awesome job developing Nextflow open source. But moving all container logic to Seqera (and I don't know the details here, maybe I am uninformed) gives a weird taste.

CharlotteAnne commented 5 months ago

What’s the purpose of enforcing this? Biocontainers are automatically generated for every bioconda package and get automatically generated upon bioconda software version bump. I’m not seeing the reason to then manually create a seqera container?

ewels commented 5 months ago

Hi both - thanks for your comments. You're right that this issue precedes some community discussion that we still need to have. That started with the recent two bytesize talks and resulting conversations on Slack, but we should still open it up to wider input.

To address your concerns:

It is still unclear to me, that images generated on demand should not harm reproducibility.

Wave generates images on demand, but Seqera Containers is a registry that sits behind Wave. The intention here is that the images are generated on demand by the developer when a package is updated - but then they are cached in the Seqera Containers registry. The image URIs will then be hardcoded into pipelines and the exact same container images will always be fetched by all users - just the same as they are today. We're also going to introduce conda-lock files (see https://github.com/nf-core/modules/issues/5835) so reproducibility should be even better than it is today.

This will also increase the used CPU hours, right?

No - Wave / Seqera Containers handles the build server side. As mentioned above, the generated images are stored in a registry and simply downloaded. So just as today, native images will be downloaded. No increase in CPU hours.

Also, it feels a bit like Seqera will have a monopoly of containers. But moving all container logic to Seqera gives a weird taste.

This one is more subjective. We will not make it a requirement to use Seqera Containers, just as we don't make it a requirement to use BioContainers today, so for me it feels about the same. We will keep the vast majority of build logic (eg. conda env files, conda lock files) on the nf-core side and will be free to reverse the decision at any point should we wish.

Biocontainers are automatically generated for every bioconda package and get automatically generated upon bioconda software version bump. I’m not seeing the reason to then manually create a seqera container?

One of the main reasons for adopting Seqera Containers is that it'll have even more automation and less manual work than the current setup. The process will roughly be:

Conda environment.yml files created or edited in a PR
- Either manually, or with automatic version bumps on new BioConda releases by Renovate
New images requested from Seqera Containers via CI automation and pinned to module
(Future goal) If an automatic Renovate bump, the PR will be automatically merged if tests pass
Pipeline developers pull in module updates as currently, which will come with the new containers

Note that this process will also work for multi-package containers, which is not currently the case with BioContainers (mulled images). So it should represent a significantly easier workflow.

Note that although Seqera Containers has a web interface (https://seqera.io/containers/) it also works programmatically via CLI, API and Nextflow (eg. nextflow inspect). Check out the recent bytesize with YouTube recording to see all this in action.

BioContainers has been brilliant for nf-core, but there are several reasons to move away:

The API is down a lot, which breaks a lot of our CI testing and developer tooling.
Docker images are hosted on quay.io, which has also had reliability problems
The BioContainers base image is outdated, with old Docker Image Format v1 and manifest version 2. This causes problems with usage, also other problems with GitPod environments
Mulled containers are awkward and slow to make
We have limited / no control over image generation to solve any of the above issues

Wave and Seqera Containers have been built specifically for our community, based on our combined experience and needs. So hopefully we can mitigate / avoid these pitfalls.

I hope these responses help clear things up! Shout if you have any questions or concerns, and I'd recommend checking out the podcast and bytesize videos in the top comment as they go through how much of this works.

JoseEspinosa commented 5 months ago

To provide a practical example of these points:

The BioContainers base image is outdated, with old Docker Image Format v1 and manifest version 2

and

We have limited / no control over image generation to solve any of the above issues

When opening a PR to chipseq I found the "docker image format v1" error, see here. To fix the error, I tried to bump to a newer version of the tool (phantompeakqualtools), but it turns out that the image's last available version was built in March 2021. Anyhow, I tried to update the image on the module and I found again the same error, see here. In this case, the most straightforward fix would be to use the wave version of the package since this image will be compliant with the new docker specifications. Otherwise, we will have to wait for bioconda getting the images update, I am not sure whether there is a timeline established for this, or do a dirty hack as creating a mulled image with phantompeakqualtools and a random small package to trigger a new build.

JoseEspinosa commented 5 months ago

As shown here, using wave images fix the issue above.

edmundmiller commented 4 months ago

I just want to clarify a difference in case anyone missed it:

Using wave to build our container images instead of mulled tools and biocontainers infrastructure.
Hosting our container images on Seqera containers, which is a container registry.

For 1, I think it's not controversial. Wave is open source, we're not relying on Seqera, we're grateful they host the service, but we could host the service ourselves if we needed to. Just like using Platform for megatests.

For 2, using Seqera containers as a registry can seem controversial, but it's really not. Right now, we're relying on biocontainers to host our singularity and docker image (on quay.io). We've had issues with pushback from biocontainers on updates, and we've had uptime issues with quay.io.

If people felt more comfortable, we could point wave at anytime to use DockerHub, quay.io, GitHub containers, or host our own ECR(AWS's image registries). But from a practicality standpoint, that's not nf-core's main skill, we don't have the resources to waste on hosting our own registry.

The main takeaway, we do need more flexibility with where our containers are hosted and more flexibility in how they are built, which Wave gives us. We'd also like better uptime from using Seqera Containers as a registry.

edmundmiller commented 4 months ago

This will make it easier for end users to move their containers to a private registry if they want to back them up:

wave.build.repository = 'quay.io/my/lab/repo'
wave.build.cacheRepository = 'quay.io/my/lab/cache-repo'

https://www.nextflow.io/docs/latest/wave.html#push-to-a-private-repository

stevekm commented 4 months ago

as per convo on Slack, will this change be compatible with AWS ECR "pull through cache"?

https://docs.aws.amazon.com/AmazonECR/latest/userguide/pull-through-cache.html

stevekm commented 4 months ago

@ewels

The intention here is that the images are generated on demand by the developer when a package is updated - but then they are cached in the Seqera Containers registry. The image URIs will then be hardcoded into pipelines and the exact same container images will always be fetched by all users - just the same as they are today.

so, the containers held on the Seqera Containers Registry will be a drop-in replacement for whatever the current container is, for each module?

Wave containers seem to generally always have some lifecycle associated with them; that is not going to be the case for these containers right? If we need to come back in a year or two and re-pull some container from Seqera Container Registry, it will still be there?

also, on a side note, where are the build files and logs going to be stored for these containers? Are there Dockerfiles available somewhere? I guess that applies to the current biocontainers too

ewels commented 4 months ago

so, the containers held on the Seqera Containers Registry will be a drop-in replacement for whatever the current container is, for each module?

Correct.

Wave containers seem to generally always have some lifecycle associated with them; that is not going to be the case for these containers right? If we need to come back in a year or two and re-pull some container from Seqera Container Registry, it will still be there?

Exactly. That was the exact motivation for the project. Due to the registry cache, they will be there forever* (we're saying minimum 5 years from when they're built, but we have no intention to delete any ever at present).

also, on a side note, where are the build files and logs going to be stored for these containers? Are there Dockerfiles available somewhere? I guess that applies to the current biocontainers too

Current Biocontainers don't have dockerfiles, they're built dynamically on CI. Seqera Containers do have dockerfiles + conda files which are stored with the build log. I'm not 100% sure that we're guaranteeing to store those for the same duration as the images but I think that we are. I can check if it's a concern.

It might be nice for us to build some system to store those + security scan results / SBOM files somewhere in nf-core as a duplicate / backup. I'd certainly like to make them visible from the nf-core website module page as a minimum anyway.

muffato commented 2 months ago

On the same vein, I'm slightly uncomfortable by people already adding Wave containers to nf-core modules. I know that the Wave registries can only be pushed from the Wave build system (right ?) so there's no way someone can tamper with a container and ship a trojaned samtools. But for instance there is cutadapt_muscle_vsearch_wget_pruned in a recent PR. How can I be verify that what's installed in the container is exactly what the person entered in environment.yml which says there should be the pip "crabs" package too ?

ewels commented 2 months ago

Great question - I'm just putting together a blog post covering much of this stuff, I will bulk up the part about this as it's important.

If it's community.wave.seqera.io then it's specifically part of the "Seqera Containers" project. This is backed by Wave but it is a narrower subset that's restricted. It can only be built from a conda environment.yml files, and it can only be pushed to by Wave.
In our nf-core adoption, the developer will never touch container manually. Everything is fully automated from the environment.yml file so there will be a full online audit log for the process visible to everyone
The full container URL in the PR you mention is community.wave.seqera.io/library/cutadapt_muscle_vsearch_wget_pruned:04f6c0370c0226c5 - the part at the end is the hash and by appending a _1 we can get the build ID. From that you can find the build details webpage (if you know the URL structure).
- This shows the meta information of when it was built, plus:
- The conda environment file
- The dockerfile
- The full build logs
- A Trivy security scan (soon also a SBOM file)
- Soon: a generated Conda lock file
The image URI tag hash is based on the input conda / docker files, so should be safe from tampering
We will build tooling around all of this, with the intention of making it all visible within the nf-core module webpage, without any manual effort or sleuthing.

Hope that makes sense! Shout if you spot anything missing / have suggestions for improvements.

ewels commented 2 months ago

Also I'd like to note: although pip will be easier to use directly now, I think that nf-core modules should still endeavour to use Bioconda where possible (it's a guideline after all). I think we should still contribute to the wider bioinformatics community where possible, and I think Bioconda gives a greater degree of review oversight and package quality.

I discussed this on the nf-core Slack for the container image you mentioned, and have already been discussing updating Bioconda with the Crabs tool authors.

stevekm commented 2 months ago

@ewels

The full container URL in the PR you mention is community.wave.seqera.io/library/cutadapt_muscle_vsearch_wget_pruned:04f6c0370c0226c5 - the part at the end is the hash and by appending a _1 we can get the build ID. From that you can find the build details webpage (if you know the URL structure).

I noticed this, however, I think I have had a few instances when the build ID has a different number appended to the end besides _1

is there any clarification on this? Is it really always gonna be _1?

muffato commented 2 months ago

The plan looks great, @ewels 👍🏼

ewels commented 2 months ago

@stevekm trust you to spot that, I was hoping it'd fly under the radar 😂

Yes you're totally right. The _1 is the retry number. So most of the time the build works and it's _1 and every subsequent request will return that. But if the build fails for whatever reason, it may be re-attempted automatically or the next time that someone requests it. If it then succeeds, then this will be the build ID for the image in the registry and it could be _2 (or _n).

We discussed this in the wave dev team on Monday and suggested that we add a new API endpoint that can return the build ID for a given container. Basically looking for all matching IDs and then retuning the one with the highest _n I think. But I'm not 100% if this will happen yet, and if so when. So I was attempting to skirt around the issue until clarified 😅

ewels commented 2 months ago

NB: jobs can fail because of conda deps that change for subsequent requests. Or also simpler stuff like the build cluster scaling and cutting off the build mid-way.

stevekm commented 2 months ago

we also recently had a ton of failed builds due to a bad pinned verison of Perl but I think someone should have already put in a PR for that

ewels commented 2 months ago

First blog post about this now out: https://nf-co.re/blog/2024/seqera-containers-part-1

second part will be more technical, working on that now.

edmundmiller commented 1 month ago

Roll out:

[ ] Start just building containers with for Linux with singularity and Docker
[ ] Complete Matrix for all of the versions we expect to support
[ ] Build out automation for updating meta.ymls with container information
[ ] Regroup once we've gotten this far

nf-core / modules

Migrate modules to Seqera Containers #5832