nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.76k stars 629 forks source link

[proposal] Add local module named configuration template #4422

Closed pditommaso closed 6 months ago

pditommaso commented 1 year ago

Context

The use of DSL2 introduces the use of module components that allow the inclusion and re-use a process more than one time with different names.

A common requirement in this scenario is that a process configuration needs to be customised, for example, to provide a specific publishDir path or command arguments.

Currently, this need is addressed by creating a monolithic configuration file in which the process configuration is provided by using the target process name. For example here

https://github.com/nf-core/rnaseq/blob/master/conf/modules.config#L174-L192

In some case, there's a conditional logic that needs to be duplicated in the workflow script. For example here:

https://github.com/nf-core/rnaseq/blob/master/conf/modules.config#L209-L224

This approach has several disadvantages:

Solution

A possible solution could be to introduce a module-level configuration file that allows the definition of named configuration template that can applied to one or more processes in the same module.

For example the following configuration file could be added in a workflow module:

config.FOO.ext.args = "--x one --y two" 
config.FOO.publishDir = [path: "/some/path"]

config.BAR.ext.args = "--x three --y four" 
config.BAR.publishDir = [path: "/other/path/${params.suffix}"]

Note: in the example above, FOO and BAR are arbitrary identifiers the user chooses. They have no relationship with the process name where they are meant to be used.

In the workflow module, they could be used using the following syntax:

include { PROC1 } from './module/path1'
include { PROC2 } from './module/path2'

workflow NAME {
  PROC1(config.FOO)
  PROC2(config.BAR) 
}

Note: explorative syntax, it may change

In the above example, the process PROC1 is executed having the config setting defined by the template with name FOO is applied. Same with with process PROC2 with config BAR.

Conditional configuration could be defined using usual if and expression statement e.g.

workflow NAME {
  def cfg = params.something ? config.FOO : config.BAR
  PROC1(cfg)
}

This approach has several advantages:

Caveats

pditommaso commented 1 year ago

See also #4322

pditommaso commented 1 year ago

Tagging @bentsherman @drpatelh

pditommaso commented 1 year ago

Possible implementation:

  1. Load the module.config when parsing the module including
  2. The module config should the parsed with ConfigParser, adding the resolved params in the evaluation context
  3. Add the module config object in the ScriptMeta
  4. Apply the config if the setting are no otherwise specified here
bentsherman commented 1 year ago

See also: #4205

Additional notes from hackathon:

I don't like that the proposed syntax (which I'll call the "programmatic config") requires you to create a new name and reference it in the pipeline script, but I see why you did it, so that the config can be set programmatically in the pipeline code.

The question is whether this programmatic config is needed. All of the if statements in the modules.config are used only to replicate the workflow logic to target a specific process, but they are eliminated if the config is defined in the module scope, even with conventional process selectors.

My impression from talking to developers at the hackathon is that they like process selectors but they just need a way to use them in the context of a module, to ensure that the selector won't be applied to completely different processes which they didn't intend.

Both process selectors and the programmatic config should work equally well, but the former is more familiar while the latter imposes some extra steps with dubious value.

For example, suppose workflow A calls subworkflow B which calls process C, and A wants to apply some config to C (taking precedence over config from B). By nf-core convention, each component will reside in its own module. The programmatic config would look like this:

// A/module.config
CONFIG_C.ext.args = '--foo'

// A/main.nf
workflow A {
  B(CONFIG_C)
}

// B/module.config
CONFIG_C.ext.args = '--bar'
CONFIG_C.publishDir = params.outdir

// B/main.nf
workflow B {
  take:
  config_c_extra
  main:
  C.config = CONFIG_C + config_c_extra
  C()
}

// C/main.nf
process C {
}

This strikes me as more complicated than it needs to be. The user is basically re-implementing the logic of process selectors in their pipeline code. While this programmability enables things that are not possible with (declarative) process selectors, I don't think uses are asking for those things.

As a baseline, I will first implement the module config with process selectors. If it enables the config to be fully declarative then that should be enough. Perhaps then we can also try the programmatic config and see what users prefer.

bentsherman commented 1 year ago

Rewriting the above example with process selectors:

// A/module.config
withName:'A:B:C' {
  ext.args = '--foo'
}

// A/main.nf
workflow A {
  B()
}

// B/module.config
withName:'B:C' {
  ext.args = '--bar'
  publishDir = params.outdir
}

// B/main.nf
workflow B {
  C()
}

// C/main.nf
process C {
}

Much simpler!

pditommaso commented 1 year ago

Ok, we talked. Let's go ahead for the first iteration using standard process selector syntax applied to process *included* in the local module.

bentsherman commented 12 months ago

Okay, the PR and rnaseq POC are working quite well. I was able to remove all of the if statements without triggering any config selector warnings, even for processes that aren't always used. This is because the module config is not applied to the process until the process is actually invoked, so there will never be any hanging configs. On the other hand, if you misspell a process selector then it should still warn you.

One minor detail is that the module config file name must always be the same, currently it is module.config. This is because Nextflow loads the module config by convention, the module script does not reference the config file directly.

As an alternative, we could do something like includeConfig for module scripts, which would allow the module script to explicitly import a config file by name:

// main.nf
includeConfig 'module.config'

process FOO {
  // ...
}

This way, users could name the config file whatever they want. On the other hand, it wouldn't make sense to support both approaches, so if we went this way then users would just have to specify the same include line of code many times. So I think I would rather not do this.

ewels commented 9 months ago

Related: made an issue about reducing config duplication of publishDir: https://github.com/nextflow-io/nextflow/issues/4661

bentsherman commented 6 months ago

Closing, see https://github.com/nextflow-io/nextflow/pull/4510#issuecomment-2071409697