Open MatthiasZepper opened 1 month ago
Great write up, thanks @MatthiasZepper!
Two minor notes:
latest
so it'll just fail without one.https
downloads for singularity SIF files instead of oras://
if we want, meaning we can preserve the current ability to download without needing singularity to be installed. I need to check the status on this.Also note that there are two things happening here: the first comments you linked to are for a PR that hasn't been merged. We have not yet adopted Seqera Containers so it's not unexpected that nf-core/tools has not yet been updated.
However, there is organic uptake of Seqera Containers by the community, putting them into old-style container
declarations, which breaks stuff as you say. This is enough to make this a high priority task, even without the upcoming planned wider adoption. So we should get on with this ASAP.
it's impossible to omit tags with Seqera containers. There is no default latest so it'll just fail without one
That is not my problem, I actually think that the SHAs are really useful. The problem is, that they now differ! Previously, a container declaration looked like this:
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://depot.galaxyproject.org/singularity/umi_tools:1.1.5--py39hf95cd2a_0' :
'biocontainers/umi_tools:1.1.5--py39hf95cd2a_0' }"
Because the umi_tools:1.1.5--py39hf95cd2a_0
part was identical, it was easy to write a function prioritize_direct_download()
that would kick out all Docker image definitions, if a matching direct https://
download was found. That was all that was needed:
def prioritize_direct_download(self, container_list):
d = {}
for c in container_list:
if re.match(r"^$|(?!^http)", d.get(k := re.sub(".*/(.*)", "\\1", c), "")):
log.debug(f"{c} matches and will be saved as {k}")
d[k] = c
return sorted(list(d.values()))
With the new Seqera Community Containers, the native Singularity image and the Docker image have different SHAs, because they are technically two different images:
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'oras://community.wave.seqera.io/library/samtools_pip_umi-tools:d8a86bb132628b96':
'community.wave.seqera.io/library/samtools_pip_umi-tools:9cfcc6ebc1d12072' }"
But that makes it a lot harder to determine, which of the records should be given priority. Particularly when downloading multiple revisions of a pipeline at the same time, there are commonly multiple versions of the same tool in the raw list of container images that somehow need to be consolidated. This problem is anything but trivial and maybe downloading both is the only way forward. (Alternatively, making an API request to Seqera Containers to obtain the alternative URI would be better of course).
It should be possible to have https downloads for singularity SIF files instead of oras:// if we want, meaning we can preserve the current ability to download without needing singularity to be installed. I need to check the status on this.
I have not yet considered this aspect, actually. It is true, that it is nice to not need Singularity when downloading, but de facto we can assume that is it available in most cases. For MacOS, I wrote a virtual machine, so that works as well. I have no/few objections to pull the images instead of using https:// downloads, that is relatively simple to implement, if the priority and image SHA resolution is somehow solved.
Also note that there are two things happening here: the first comments you linked to are for a PR that hasn't been merged. We have not yet adopted Seqera Containers so it's not unexpected that nf-core/tools has not yet been updated. However, there is organic uptake of Seqera Containers by the community, putting them into old-style container declarations, which breaks stuff as you say.
Yep, but it is the very same people discussing the official proposal and approving the PRs for the "organic" updates to modules. Unfortunately, the offline capability of a pipeline is essentially an untestable feature, because I can't take a Github runner offline, even though I tried to catch some download issues during a test on the pipeline level. Once a pipeline is released, the code is unfortunately out in the wild and therefore, I wish that offline usability was not an afterthought of module or pipeline updates.
Writing up the blog post / migration guide for Seqera Containers adoption now. It should massively simplify the task for nf-core download
. However, we will still need to handle old pipelines / container edge cases. I'm hoping that an updated nextflow inspect
that can return all processes will be able to handle this though. Ideally this will mean that most of the complex logic that we have within nf-core download
won't be needed.
Do you see any issues with this @MatthiasZepper ?
Writing up the blog post / migration guide for Seqera Containers adoption now. It should massively simplify the task for
nf-core download
.
I haven't read your blog post yet, so hard to tell if I see issues with it - where do I find the draft? But I already commented on your POC drafts that I really like the idea of having a central YAML/JSON that is being generated upon pipeline release and that contains all container URIs. That will indeed help tons, since one does not have to sift through all modules and configs to find container declarations.
However, we will still need to handle old pipelines / container edge cases.
I am starting to lean towards telling people to use an older version of tools for that purpose and just dropping the support entirely. If it needs to be, one could probably retain the old code and switch to it if the requested pipeline revision does not have said YAML yet.
I'm hoping that an updated
nextflow inspect
that can return all processes will be able to handle this though. Ideally, this will mean that most of the complex logic that we have withinnf-core download
won't be needed.
Yes, that would be much appreciated. I did not know that inspect
is being developed into that direction, but yes: It's adoption into download was mostly hampered by the issue of specifying upfront how a particular pipeline will be run.
If it was able to return the containers URIs also for a different platform architecture than the one it currently runs on, that would be the cherry on the cake (e.g. I execute it on a Mac and ARM64 architecture, but I would like to get the container URIs for an amd64 Linux pipeline execution). Since it considers the configs, I think that should be possible already...I should try that out.
Description of the bug
While I got the impression that the adoption of Seqera Community Containers would be tightly linked to the new dynamically created container paths upon pipeline release, this does not hold true.
We now already have 70 modules (mainly local but also some in the module's directory), which have no support by
nf-core (pipelines) download
and may not run offline.This shows mainly by two means:
oras://
protocol is simply ignored. Instead, the Docker version is converted, which takes much longer. Also, the converted image will not be found if the containerEngine is set to 'Singularity', because the SHA of the native Singularity image and the Docker converted image differ.depot.galaxyproject.org-community.wave.seqera.io-library-python_numpy_pip_checkqc_interop-b5301d9801b8e66b.img
. Evidently there is no use for this and if a user copies the files withcp -L
, a lot of unessessary data duplication happens.I would have preferred a synchronized and coordinated update, but since this stuff is now out in the wild, I reckon that
nf-core (pipelines) download
should be adapted to:gather_registries()
hardcodescommunity.wave.seqera.io
similar todepot.galaxyproject.org
, so it is not required to explicitly configure this registry, e.g. with the--library
/-l
argument.singularity_image_filenames()
handles the filenames correctly. Mind the library subfolder, such that the resulting image name should becommunity.wave.seqera.io-library-python_pip_samshee-84a770c9853c725d.img
fororas://community.wave.seqera.io/library/python_pip_samshee:84a770c9853c725d
prioritize_direct_download()
(or write a dedicated function for this purpose) to also prioritizeoras://
over the Docker images and adapt the corresponding test. Mind that the SHA sums differ, and this needs to be accounted for:oras://community.wave.seqera.io/library/python_pip_samshee:84a770c9853c725d
andcommunity.wave.seqera.io/library/python_pip_samshee:e8a5c47ec32efa42
are equivalent. Just ignoring the SHA sums however could prove risky, as there is no way to assure that the correct version of the tool is used then. Mind that it is not uncommon for a pipeline to use tools in different versions within the same release. Therefore, downloading both is possibly the only way forward without an API request to Seqera Community Containers.get_singularity_images()
should sort theoras://
links tocontainers_pull
. Ensure that the output path is correct, the pulling works. Lastly, fix the test.symlink_singularity_images()
handles the images correctly and trims the hard-coded registries prior to building the symlinks. Update the test.Thx
Command used and terminal output
No response
System information
No response