[proposal] Add local module named configuration template

pditommaso commented 1 year ago

Context

The use of DSL2 introduces the use of module components that allow the inclusion and re-use a process more than one time with different names.

A common requirement in this scenario is that a process configuration needs to be customised, for example, to provide a specific publishDir path or command arguments.

Currently, this need is addressed by creating a monolithic configuration file in which the process configuration is provided by using the target process name. For example here

https://github.com/nf-core/rnaseq/blob/master/conf/modules.config#L174-L192

In some case, there's a conditional logic that needs to be duplicated in the workflow script. For example here:

https://github.com/nf-core/rnaseq/blob/master/conf/modules.config#L209-L224

This approach has several disadvantages:

Monolithic configuration file
The process configuration is out of the context from the workflow code
Very fragile; it's enough a typo in the config file to break the process configuration
The use of monolithic breaks the workflow modularization
Very hard to maintain.

Solution

A possible solution could be to introduce a module-level configuration file that allows the definition of named configuration template that can applied to one or more processes in the same module.

For example the following configuration file could be added in a workflow module:

config.FOO.ext.args = "--x one --y two" 
config.FOO.publishDir = [path: "/some/path"]

config.BAR.ext.args = "--x three --y four" 
config.BAR.publishDir = [path: "/other/path/${params.suffix}"]

Note: in the example above, FOO and BAR are arbitrary identifiers the user chooses. They have no relationship with the process name where they are meant to be used.

In the workflow module, they could be used using the following syntax:

include { PROC1 } from './module/path1'
include { PROC2 } from './module/path2'

workflow NAME {
  PROC1(config.FOO)
  PROC2(config.BAR) 
}

Note: explorative syntax, it may change

In the above example, the process PROC1 is executed having the config setting defined by the template with name FOO is applied. Same with with process PROC2 with config BAR.

Conditional configuration could be defined using usual if and expression statement e.g.

workflow NAME {
  def cfg = params.something ? config.FOO : config.BAR
  PROC1(cfg)
}

This approach has several advantages:

keep the config settings in the nextflow config file, aside the module where it's expected to be used
promote a declarative approach for configuration files
decouple the configuration from the target process name and therefore, increase reusability
Avoid the verbosity of withNameselector mechanism

Caveats

Named config templates should only accessible in the module where they are defined
A conventional config file should used e.g. module.config or main.config
The withName selector should be able to override settings defined by a config template (to not break existing config e.g. groundswell)

pditommaso commented 1 year ago

See also: #4205

Additional notes from hackathon:

need to evaluate process selectors vs new proposed syntax for module config
nf-core modules will only come with a test.config to be loaded by the test harness like a normal config, not used in real pipelines
pipeline developers will install a module and add a nextflow.config / module.config to each module with pipeline-specific config
process modules will not have module config, only subworkflows and above, because they are invoking the processes
module config can be overwritten by outer subworkflows, then pipeline-level config, then user config, then CLI options

I don't like that the proposed syntax (which I'll call the "programmatic config") requires you to create a new name and reference it in the pipeline script, but I see why you did it, so that the config can be set programmatically in the pipeline code.

The question is whether this programmatic config is needed. All of the if statements in the modules.config are used only to replicate the workflow logic to target a specific process, but they are eliminated if the config is defined in the module scope, even with conventional process selectors.

My impression from talking to developers at the hackathon is that they like process selectors but they just need a way to use them in the context of a module, to ensure that the selector won't be applied to completely different processes which they didn't intend.

Both process selectors and the programmatic config should work equally well, but the former is more familiar while the latter imposes some extra steps with dubious value.

For example, suppose workflow A calls subworkflow B which calls process C, and A wants to apply some config to C (taking precedence over config from B). By nf-core convention, each component will reside in its own module. The programmatic config would look like this:

// A/module.config
CONFIG_C.ext.args = '--foo'

// A/main.nf
workflow A {
  B(CONFIG_C)
}

// B/module.config
CONFIG_C.ext.args = '--bar'
CONFIG_C.publishDir = params.outdir

// B/main.nf
workflow B {
  take:
  config_c_extra
  main:
  C.config = CONFIG_C + config_c_extra
  C()
}

// C/main.nf
process C {
}

This strikes me as more complicated than it needs to be. The user is basically re-implementing the logic of process selectors in their pipeline code. While this programmability enables things that are not possible with (declarative) process selectors, I don't think uses are asking for those things.

As a baseline, I will first implement the module config with process selectors. If it enables the config to be fully declarative then that should be enough. Perhaps then we can also try the programmatic config and see what users prefer.

bentsherman commented 1 year ago

Rewriting the above example with process selectors:

// A/module.config
withName:'A:B:C' {
  ext.args = '--foo'
}

// A/main.nf
workflow A {
  B()
}

// B/module.config
withName:'B:C' {
  ext.args = '--bar'
  publishDir = params.outdir
}

// B/main.nf
workflow B {
  C()
}

// C/main.nf
process C {
}

Much simpler!

pditommaso commented 1 year ago

Ok, we talked. Let's go ahead for the first iteration using standard process selector syntax applied to process *included* in the local module.

bentsherman commented 12 months ago

Okay, the PR and rnaseq POC are working quite well. I was able to remove all of the if statements without triggering any config selector warnings, even for processes that aren't always used. This is because the module config is not applied to the process until the process is actually invoked, so there will never be any hanging configs. On the other hand, if you misspell a process selector then it should still warn you.

One minor detail is that the module config file name must always be the same, currently it is module.config. This is because Nextflow loads the module config by convention, the module script does not reference the config file directly.

As an alternative, we could do something like includeConfig for module scripts, which would allow the module script to explicitly import a config file by name:

// main.nf
includeConfig 'module.config'

process FOO {
  // ...
}

This way, users could name the config file whatever they want. On the other hand, it wouldn't make sense to support both approaches, so if we went this way then users would just have to specify the same include line of code many times. So I think I would rather not do this.

ewels commented 9 months ago

Related: made an issue about reducing config duplication of publishDir: https://github.com/nextflow-io/nextflow/issues/4661

bentsherman commented 6 months ago

Closing, see https://github.com/nextflow-io/nextflow/pull/4510#issuecomment-2071409697

nextflow-io / nextflow