Open ewels opened 5 months ago
It is still unclear to me, that images generated on demand should not harm reproducibility.
Was this ever evaluated?
This will also increase the used CPU hours, right? Also, it feels a bit like Seqera will have a monopoly of containers. I am aware that Seqera did an awesome job developing Nextflow open source. But moving all container logic to Seqera (and I don't know the details here, maybe I am uninformed) gives a weird taste.
What’s the purpose of enforcing this? Biocontainers are automatically generated for every bioconda package and get automatically generated upon bioconda software version bump. I’m not seeing the reason to then manually create a seqera container?
Hi both - thanks for your comments. You're right that this issue precedes some community discussion that we still need to have. That started with the recent two bytesize talks and resulting conversations on Slack, but we should still open it up to wider input.
To address your concerns:
It is still unclear to me, that images generated on demand should not harm reproducibility.
Wave generates images on demand, but Seqera Containers is a registry that sits behind Wave. The intention here is that the images are generated on demand by the developer when a package is updated - but then they are cached in the Seqera Containers registry. The image URIs will then be hardcoded into pipelines and the exact same container images will always be fetched by all users - just the same as they are today. We're also going to introduce conda-lock files (see https://github.com/nf-core/modules/issues/5835) so reproducibility should be even better than it is today.
This will also increase the used CPU hours, right?
No - Wave / Seqera Containers handles the build server side. As mentioned above, the generated images are stored in a registry and simply downloaded. So just as today, native images will be downloaded. No increase in CPU hours.
Also, it feels a bit like Seqera will have a monopoly of containers. But moving all container logic to Seqera gives a weird taste.
This one is more subjective. We will not make it a requirement to use Seqera Containers, just as we don't make it a requirement to use BioContainers today, so for me it feels about the same. We will keep the vast majority of build logic (eg. conda env files, conda lock files) on the nf-core side and will be free to reverse the decision at any point should we wish.
Biocontainers are automatically generated for every bioconda package and get automatically generated upon bioconda software version bump. I’m not seeing the reason to then manually create a seqera container?
One of the main reasons for adopting Seqera Containers is that it'll have even more automation and less manual work than the current setup. The process will roughly be:
environment.yml
files created or edited in a PR
Note that this process will also work for multi-package containers, which is not currently the case with BioContainers (mulled images). So it should represent a significantly easier workflow.
Note that although Seqera Containers has a web interface (https://seqera.io/containers/) it also works programmatically via CLI, API and Nextflow (eg. nextflow inspect
). Check out the recent bytesize with YouTube recording to see all this in action.
BioContainers has been brilliant for nf-core, but there are several reasons to move away:
Wave and Seqera Containers have been built specifically for our community, based on our combined experience and needs. So hopefully we can mitigate / avoid these pitfalls.
I hope these responses help clear things up! Shout if you have any questions or concerns, and I'd recommend checking out the podcast and bytesize videos in the top comment as they go through how much of this works.
To provide a practical example of these points:
- The BioContainers base image is outdated, with old Docker Image Format v1 and manifest version 2
and
- We have limited / no control over image generation to solve any of the above issues
When opening a PR to chipseq I found the "docker image format v1" error, see here. To fix the error, I tried to bump to a newer version of the tool (phantompeakqualtools
), but it turns out that the image's last available version was built in March 2021. Anyhow, I tried to update the image on the module and I found again the same error, see here.
In this case, the most straightforward fix would be to use the wave version of the package since this image will be compliant with the new docker specifications. Otherwise, we will have to wait for bioconda getting the images update, I am not sure whether there is a timeline established for this, or do a dirty hack as creating a mulled image with phantompeakqualtools
and a random small package to trigger a new build.
As shown here, using wave images fix the issue above.
I just want to clarify a difference in case anyone missed it:
Using wave to build our container images instead of mulled tools and biocontainers infrastructure.
Hosting our container images on Seqera containers, which is a container registry.
For 1, I think it's not controversial. Wave is open source, we're not relying on Seqera, we're grateful they host the service, but we could host the service ourselves if we needed to. Just like using Platform for megatests.
For 2, using Seqera containers as a registry can seem controversial, but it's really not. Right now, we're relying on biocontainers to host our singularity and docker image (on quay.io). We've had issues with pushback from biocontainers on updates, and we've had uptime issues with quay.io.
If people felt more comfortable, we could point wave at anytime to use DockerHub, quay.io, GitHub containers, or host our own ECR(AWS's image registries). But from a practicality standpoint, that's not nf-core's main skill, we don't have the resources to waste on hosting our own registry.
The main takeaway, we do need more flexibility with where our containers are hosted and more flexibility in how they are built, which Wave gives us. We'd also like better uptime from using Seqera Containers as a registry.
This will make it easier for end users to move their containers to a private registry if they want to back them up:
wave.build.repository = 'quay.io/my/lab/repo'
wave.build.cacheRepository = 'quay.io/my/lab/cache-repo'
https://www.nextflow.io/docs/latest/wave.html#push-to-a-private-repository
as per convo on Slack, will this change be compatible with AWS ECR "pull through cache"?
https://docs.aws.amazon.com/AmazonECR/latest/userguide/pull-through-cache.html
@ewels
The intention here is that the images are generated on demand by the developer when a package is updated - but then they are cached in the Seqera Containers registry. The image URIs will then be hardcoded into pipelines and the exact same container images will always be fetched by all users - just the same as they are today.
so, the containers held on the Seqera Containers Registry will be a drop-in replacement for whatever the current container is, for each module?
Wave containers seem to generally always have some lifecycle associated with them; that is not going to be the case for these containers right? If we need to come back in a year or two and re-pull some container from Seqera Container Registry, it will still be there?
also, on a side note, where are the build files and logs going to be stored for these containers? Are there Dockerfiles available somewhere? I guess that applies to the current biocontainers too
so, the containers held on the Seqera Containers Registry will be a drop-in replacement for whatever the current container is, for each module?
Correct.
Wave containers seem to generally always have some lifecycle associated with them; that is not going to be the case for these containers right? If we need to come back in a year or two and re-pull some container from Seqera Container Registry, it will still be there?
Exactly. That was the exact motivation for the project. Due to the registry cache, they will be there forever* (we're saying minimum 5 years from when they're built, but we have no intention to delete any ever at present).
also, on a side note, where are the build files and logs going to be stored for these containers? Are there Dockerfiles available somewhere? I guess that applies to the current biocontainers too
Current Biocontainers don't have dockerfiles, they're built dynamically on CI. Seqera Containers do have dockerfiles + conda files which are stored with the build log. I'm not 100% sure that we're guaranteeing to store those for the same duration as the images but I think that we are. I can check if it's a concern.
It might be nice for us to build some system to store those + security scan results / SBOM files somewhere in nf-core as a duplicate / backup. I'd certainly like to make them visible from the nf-core website module page as a minimum anyway.
On the same vein, I'm slightly uncomfortable by people already adding Wave containers to nf-core modules. I know that the Wave registries can only be pushed from the Wave build system (right ?) so there's no way someone can tamper with a container and ship a trojaned samtools. But for instance there is cutadapt_muscle_vsearch_wget_pruned
in a recent PR. How can I be verify that what's installed in the container is exactly what the person entered in environment.yml
which says there should be the pip "crabs" package too ?
Great question - I'm just putting together a blog post covering much of this stuff, I will bulk up the part about this as it's important.
community.wave.seqera.io
then it's specifically part of the "Seqera Containers" project. This is backed by Wave but it is a narrower subset that's restricted. It can only be built from a conda environment.yml
files, and it can only be pushed to by Wave.container
manually. Everything is fully automated from the environment.yml
file so there will be a full online audit log for the process visible to everyonecommunity.wave.seqera.io/library/cutadapt_muscle_vsearch_wget_pruned:04f6c0370c0226c5
- the part at the end is the hash and by appending a _1
we can get the build ID. From that you can find the build details webpage (if you know the URL structure).
Hope that makes sense! Shout if you spot anything missing / have suggestions for improvements.
Also I'd like to note: although pip
will be easier to use directly now, I think that nf-core modules should still endeavour to use Bioconda where possible (it's a guideline after all). I think we should still contribute to the wider bioinformatics community where possible, and I think Bioconda gives a greater degree of review oversight and package quality.
I discussed this on the nf-core Slack for the container image you mentioned, and have already been discussing updating Bioconda with the Crabs tool authors.
@ewels
- The full container URL in the PR you mention is
community.wave.seqera.io/library/cutadapt_muscle_vsearch_wget_pruned:04f6c0370c0226c5
- the part at the end is the hash and by appending a_1
we can get the build ID. From that you can find the build details webpage (if you know the URL structure).
I noticed this, however, I think I have had a few instances when the build ID has a different number appended to the end besides _1
is there any clarification on this? Is it really always gonna be _1
?
The plan looks great, @ewels 👍🏼
@stevekm trust you to spot that, I was hoping it'd fly under the radar 😂
Yes you're totally right. The _1
is the retry number. So most of the time the build works and it's _1
and every subsequent request will return that. But if the build fails for whatever reason, it may be re-attempted automatically or the next time that someone requests it. If it then succeeds, then this will be the build ID for the image in the registry and it could be _2
(or _n
).
We discussed this in the wave dev team on Monday and suggested that we add a new API endpoint that can return the build ID for a given container. Basically looking for all matching IDs and then retuning the one with the highest _n
I think. But I'm not 100% if this will happen yet, and if so when. So I was attempting to skirt around the issue until clarified 😅
NB: jobs can fail because of conda deps that change for subsequent requests. Or also simpler stuff like the build cluster scaling and cutting off the build mid-way.
we also recently had a ton of failed builds due to a bad pinned verison of Perl but I think someone should have already put in a PR for that
First blog post about this now out: https://nf-co.re/blog/2024/seqera-containers-part-1
second part will be more technical, working on that now.
Roll out:
Seqera Containers is a new service to provide Docker + Singularity containers from any Conda / PyPI packages. Images are generated on demand and can include multiple packages.
Links for background:
We should strip out the BioContainers quay.io docker images + Galaxy server Singularity images and replace with image URIs from Seqera Containers.
We should use this change as an opportunity to rethink the optimal code structure for defining image names. This is currently under discussion. Group consensus can be posted here once achieved for broader community approval.
Milestone to track broad progress on this update: https://github.com/nf-core/modules/milestone/6