nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.72k stars 622 forks source link

Add `nextflow modules` command to install remote modules #4112

Open edmundmiller opened 1 year ago

edmundmiller commented 1 year ago

New feature

include { BOWTIE_ALIGN } from "github:nf-core/modules/modules/bowtie/align@somehashabcd"

Go:

import "github.com/google/uuid"

Snakemake:

github("owner/repo", path="workflow/Snakefile", tag="v1.0.0")

Deno:

import chalk from "npm:chalk@5";

Groovy (Grapes):

@Grab(group = 'com.puravida-software.groogle', module = 'groogle-sheet', version = '3.0.0')
@GrabExclude(group = 'org.codehaus.groovy', module = '*')

Example

Usage scenario

Baked in sharing of modules and module version control, instead of tools like nf-core to manage updating and sharing them.

Suggest implementation

That's above my pay grade. I'd imagine it would work very similar to plugins, check before the pipeline starts that all the modules are there. Then store them in .nextflow/modules or somewhere.

I guess that wouldn't be too hard.

  1. Find all the remote file requests
  2. Check for them in .nextflow/modules
  3. If the modules are not there, download them
  4. Nextflow replaces the lines behind the scenes linking to .nextflow/modules/nf-core/modules/modules/bowtie/align/main.nf

Same thing as plugins and downloading pipelines from repos, I think.

adamrtalbot commented 1 year ago

Given how successful these have been in nf-core, I think this is worth considering.

I'd be tempted to store the cache under ${projectDir}/.nextflow/modules so that it could be checked under version control of one repo, similar to how nf-core does it. This would mean users could just pull one repo to run a pipeline and wouldn't be reliant on additional calls out to the internet (e.g. if the remote disappeared). The downside is they would have 38 duplicates of SAMTOOLS_SORT and it's more elegant to have a central storage place. Perhaps an additional command for fixing the module by copying into the project directory could be included.

edmundmiller commented 1 year ago

Supporting patching like nf-core would be nice as well.

bentsherman commented 1 year ago

I've thought about creating a module registry (similar to the plugin registry) that would encode this exact information -- repo, path, commit hash -- and would allow for a broader ecosystem of modules that don't have to meet nf-core standards. I imagined that the nextflow CLI would have a modules command that would basically work like nf-core modules, allowing you to pull modules into your pipeline.

Being able to pull the modules at runtime would be convenient, but the challenge is maintaining a local cache. You could probably cover 95% of modules just by downloading nf-core/modules, but how do you manage different revisions of the same repo being pulled? Using bare git repos (#2870) should make this idea somewhat easier.

However, due to my experience with npm, I'm hesitant to create yet another package manager with recursive dependencies and its associated horrors. The nf-core modules approach forces the developer to think through their entire module hierarchy up front. It's more tedious, but I wonder whether the convenience of loading modules at runtime will outweigh the additional complexity that it would create.

A simple middle-ground could be to just add the nextflow modules command, which works just like nf-core modules, but allows you to specify a repo+path+hash instead of an nf-core/module. You still have to maintain the module in your pipeline code, but you aren't limited to nf-core anymore. No need for a module registry, and maybe from there it would be easier to assess the extra usefulness of loading modules at runtime.

pditommaso commented 1 year ago

Agree with Ben. Also, adding import from Git repos would create horrible problems with non-public repositories (how to authenticate them?).

Another package manager for nextflow that was proposed for nextflow was WFPM, even tho not sure about the current status of the project

adamrtalbot commented 1 year ago

A simple middle-ground could be to just add the nextflow modules command, which works just like nf-core modules, but allows you to specify a repo+path+hash instead of an nf-core/module. You still have to maintain the module in your pipeline code, but you aren't limited to nf-core anymore. No need for a module registry, and maybe from there it would be easier to assess the extra usefulness of loading modules at runtime.

This is my preferred route. It's the most lightweight solution while being useful and side steps some of the problems (authentication, running offline, modifying the imported module) and we know the nf-core system works reasonably well but it would be good to extend it beyond the strict nf-core structure.

To make it more user friendly, you could perform the import based on the Nextflow files. From Edmund's first line, the following would be equivalent:

nextflow modules import BOWTIE_ALIGN 'github:nf-core/modules/modules/bowtie/align@somehashabcd'

Or...

nextflow modules import
# Where main.nf includes:
include { BOWTIE_ALIGN } from "github:nf-core/modules/modules/bowtie/align@somehashabcd"
edmundmiller commented 1 year ago

The nf-core modules approach forces the developer to think through their entire module hierarchy up front.

Completely agree. I think this is just moving the modules from nf-core.ymldirectly into the main.nf itself.

Subworkflows depending on other modules is another story...

(how to authenticate them?)

This may be naive, but I was imagining it would hook into providers.

bentsherman commented 1 year ago

I discovered today that this feature request has a long history. Notable issues include #1463 and https://github.com/nf-core/modules/issues/8 .

Several approaches have been considered, including git submodules (and subtree?), the WFPM, and piggy-backing on an existing package manager like npm or conda. I was interested to see that many people were on board to use conda, but the nf-core modules approach won out over everything else because it just worked and was simpler.

I still think the best solution is to just incorporate nf-core modules into Nextflow, just wanted to add the history of this discussion for posterity.

As for this shorthand:

include { BOWTIE_ALIGN } from "github:nf-core/modules/modules/bowtie/align@somehashabcd"

I remain reluctant to implement something like this. I know that we can do it (including the authentication piece), but the question is whether we should.

Importing remote modules directly would remove the need to have a separate modules.json file to keep track of which modules were installed. But, it would open up several worm-filled cans.

You could have different versions of the same tool installed under different subworkflows.

Offline usage, and anything that requires inspecting the module dependency tree, would be more complicated. We would need another Nextflow command to parse the pipeline script and recursively fetch the remote modules, just to figure out which modules are required. I think I prefer having all of the module dependencies listed in one place, even if it requires another file.

pditommaso commented 1 year ago

I agree with Ben's analysis. Remote modules are not supported by design.

bentsherman commented 1 year ago

However we are still discussing the idea of a nextflow modules command to allow importing remote modules during development.

ewels commented 1 year ago

Guys, you know that nf-core modules works with non-nf-core repos / modules, right? There's an -R flag to specify the base repo. We rewrote a bunch of it not long ago to use a git clone approach in the back end to fetch the repos, so it should now work with any git provider too (at first it only worked with GitHub.com). We have a CI test that imports from https://gitlab.com/nf-core/modules-test for example.

This could do with better documentation (as does everything always) and probably a better interface. I'm hoping to introduce an optional TUI for all nf-core commands before long, which should help a lot.

ewels commented 1 year ago

FWIW, I really like the approach that we settled on. Nextflow pipelines are not equivalent to other coding languages and have different requirements (eg. patching, as mentioned). Im not convinced that keeping modules code outside of a repo would bring much in the way of practical advantages.

ewels commented 1 year ago

Final thought: I'm not adverse to moving nf-core modules functionality into a new nextflow modules command / commands. Indeed this fits with my general aim of incubating features within nf-core for later adoption within the wider community.

If this happens, I think we can probably keep the nf-core command with whatever eye-candy it has, and just remove the back-end functional code to replace it with an external call to Nextflow.

bentsherman commented 1 year ago

Indeed, I didn't know that about nf-core modules until I read through the old github issues. So we would literally just be porting it to Groovy and adding it as a Nextflow command.

I also see nf-core as a good incubator for possible Nextflow features, and this one seems about ready to hatch. It shouldn't be too hard to implement as Nextflow already has good infra for interacting with git repos, although I'm curious to look into the guts of nf-core modules patch...

ewels commented 1 year ago

What about nf-core subworkflows? The two share a lot of code / concepts. I think that they need to move together.

I must confess to being a little terrified at the prospect of moving them 😆 They are deeply integrated into the nf-core codebase, and quite finicky. Moving them will also mean rewriting a bunch of the nf-core code. I think we should think quite hard about the advantages that moving them would bring before rushing into anything 👀 We also need to be completely happy with the current setup of using a modules.json file to track active modules, as well as how subworkflows work. This has been under fairly active development still recently.

bentsherman commented 1 year ago

Nextflow does not have as strict a definition as nf-core regarding modules. To Nextflow, a module is just a script that may contain any number of processes, workflows, and functions. Whereas nf-core seems to enforce one process per module, workflows as separate, and AFAIK no concept of custom functions in modules.

Ideally, the nextflow modules command shouldn't much care about what's in a module. Just give it a remote git path to a module script and it will install that script into your repo. You can then include whichever individual components you want from that module.

I prefer Nextflow's definition of a module, although I see the utility of keeping subworkflows separate. Maybe we could support this convention by allowing the user to change the target path in their repo. By default if would be ./modules, but they could set it to e.g. subworkflows when they install a subworkflow.

pditommaso commented 1 year ago

I don't think nextflow should care about managing modules. nf-core tooling is already doing well, and there are also plenty of other tools. Let's focus on core features.

edmundmiller commented 1 year ago

I don't think nextflow should care about managing modules. nf-core tooling is already doing well, and there are also plenty of other tools. Let's focus on core features.

I think that's a good point.

Maybe this is going the direction of nf-validation, of it's moreso taking nf-core functionality and turning it into a general use plugin that hooks into Nextflow instead of external tools.

This probably should have been in the discussions, so I apologize my reflex is just to make an issue!😬

adamrtalbot commented 1 year ago

Maybe this is going the direction of nf-validation, of it's moreso taking nf-core functionality and turning it into a general use plugin that hooks into Nextflow instead of external tools.

This sounds like the best solution so far.

ewels commented 1 year ago

Are we talking about pulling into an ephemeral cache at run time (end user), or copying into the pipeline source code directory (developer)? I think both have been mentioned in this thread.

I think a plugin makes sense if doing this at run time. But I'm not 100% clear on what the advantage of doing this is tbh..?

adamrtalbot commented 1 year ago

or copying into the pipeline source code directory (developer)?

my preferred method.

The main advantage of course, is for everyone who doesn't like nf-core 😆

ewels commented 1 year ago

Has anyone tried nf-core modules with non-nf-core modules? I think it should be fairly agnostic already. We collect the following info in modules.json:

The rest is manual. So I think that the only convention it should* require is - a git repo, with a file called main.nf in a directory somewhere.

* Untested

adamrtalbot commented 1 year ago

I can't see anything about importing from a different repo?

nf-core modules install --help

                                          ,--./,-.
          ___     __   __   __   ___     /,-._.--~\
    |\ | |__  __ /  ` /  \ |__) |__         }  {
    | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                          `._,._,'

    nf-core/tools version 2.9 - https://nf-co.re

 Usage: nf-core modules install [OPTIONS] <tool> or <tool/subtool>                                  

 Install DSL2 modules within a pipeline.                                                            
 Fetches and installs module files from a remote repo e.g. nf-core/modules.                         

╭─ Options ────────────────────────────────────────────────────────────────────────────────────────╮
│ --dir     -d  PATH          Pipeline directory. [default: current working directory]             │
│ --prompt  -p                Prompt for the version of the module                                 │
│ --force   -f                Force reinstallation of module if it already exists                  │
│ --sha     -s  <commit sha>  Install module at commit SHA                                         │
│ --help    -h                Show this message and exit.                                          │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
adamrtalbot commented 1 year ago

OK worked it out and it doesn't work unless you follow the nf-core structure:

> nf-core modules --git-remote https://github.com/genepi/nf-gwas list remote

                                          ,--./,-.
          ___     __   __   __   ___     /,-._.--~\
    |\ | |__  __ /  ` /  \ |__) |__         }  {
    | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                          `._,._,'

    nf-core/tools version 2.9 - https://nf-co.re

CRITICAL 'org_path' key not present in .nf-core.yml 
> nf-core modules --git-remote https://github.com/epi2me-labs/wf-basecalling list remote

                                          ,--./,-.
          ___     __   __   __   ___     /,-._.--~\
    |\ | |__  __ /  ` /  \ |__) |__         }  {
    | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                          `._,._,'

    nf-core/tools version 2.9 - https://nf-co.re

CRITICAL 'org_path' key not present in .nf-core.yml                                                                                                                                                                                                                                                                         
ewels commented 1 year ago

Please make an issue 😊

ewels commented 6 months ago

Reposting interesting link from @jordeu about how Go has the best of both worlds with "vendoring": https://mahmoudaljadan.medium.com/go-modules-and-vendors-simplify-dependency-management-in-your-golang-project-a29689eb26b1

pditommaso commented 6 months ago