seqeralabs / wave

On-demand containers provisioning service
https://seqera.io/wave/
GNU Affero General Public License v3.0
35 stars 4 forks source link

Offline support for Wave in Nextflow #323

Open ewels opened 1 year ago

ewels commented 1 year ago

Wave in Nextflow is beautifully simple - no need to define container URIs, just the conda package names and we get everything for free. However, for wide adoption (or at least, adoption in @nf-core), we need to support offline usage of pipelines.

For offline work, the process is typically as follows:

This hinges on Nextflow checking the local container cache (eg. NXF_SINGULARITY_CACHE) for images before attempting to download them. Things like Singularity container filenames are predictable so it's easy for us to wrap download functionality into tooling like nf-core download and make sure that they are available.

However, this assumption breaks with Wave. Currently, Nextflow needs to reach out to the Wave service (online) to find out the container URI and resulting local cache filename. So without an internet connection, it doesn't know where to check locally.

As I see it, we have two options:

edmundmiller commented 1 year ago

I think nextflow inspect does that:

$ nextflow inspect main.nf -profile local

{
    "processes": [
        {
            "name": "r2_CELL_CYCLE_SCORING_AND_PCA",
            "container": "wave.seqera.io/wt/4fc019059a1f/wave/build:create_objects--c32b27bc3124db00"
        },
        ...

So we just hook nextflow inspect into nf-core download. When they're running `nf-core download, they should have an internet connection, right? Worse case we export the containers on release and commit the json updates to the repos!

ewels commented 1 year ago

Yeah exactly, that's essentially my option 2 - fetch the container URIs at the point of download (or release) and have an associated config file that specifies the container URIs.

It basically means that offline users won't be using Wave at all, it's just a regular Nextflow run with containers as usual, but maybe this is the best solution.. My main issue with it is that it forces people to use nf-core download.

pditommaso commented 1 year ago

I'm inclined to option 2 too. nextflow inspect command was made keeping this possibility in mind.

edmundmiller commented 1 year ago

It basically means that offline users won't be using Wave at all, it's just a regular Nextflow run with containers as usual, but maybe this is the best solution.. My main issue with it is that it forces people to use nf-core download.

Would users need to use wave at all, besides checking whether an image has been created? I was having that issue where it was returning the image name before it even got built (ie quay.io/nf-core/modules/bowtie:bowtie-1.3.0_samtools-1.16.1--82705d624eee2198). So it should be able to go out and look for that image(I'm guessing right now it's auth-ing with the repo through Tower Platform).

But if we could tweak the behavior slightly (it might already be this):

  1. Check if the image repo is public
  2. If the repo is private, auth through platform, and then try to download.
edmundmiller commented 5 months ago

What if we ran nextflow inspect in CI in the pipelines on release, and had a containers.json that got generated.

Every single commit wouldn't be reproducible, but the releases would be able to be nf-core downloadable.

I think that's a good compromise. It would vastly simplify the container downloading logic from nf-core download

edmundmiller commented 5 months ago

https://github.com/seqeralabs/nf-aggregate/pull/43 Basically this 😆