Closed pditommaso closed 6 months ago
See also #4322
Tagging @bentsherman @drpatelh
Possible implementation:
module.config
when parsing the module includingparams
in the evaluation context See also: #4205
Additional notes from hackathon:
test.config
to be loaded by the test harness like a normal config, not used in real pipelinesnextflow.config
/ module.config
to each module with pipeline-specific configI don't like that the proposed syntax (which I'll call the "programmatic config") requires you to create a new name and reference it in the pipeline script, but I see why you did it, so that the config can be set programmatically in the pipeline code.
The question is whether this programmatic config is needed. All of the if statements in the modules.config
are used only to replicate the workflow logic to target a specific process, but they are eliminated if the config is defined in the module scope, even with conventional process selectors.
My impression from talking to developers at the hackathon is that they like process selectors but they just need a way to use them in the context of a module, to ensure that the selector won't be applied to completely different processes which they didn't intend.
Both process selectors and the programmatic config should work equally well, but the former is more familiar while the latter imposes some extra steps with dubious value.
For example, suppose workflow A calls subworkflow B which calls process C, and A wants to apply some config to C (taking precedence over config from B). By nf-core convention, each component will reside in its own module. The programmatic config would look like this:
// A/module.config
CONFIG_C.ext.args = '--foo'
// A/main.nf
workflow A {
B(CONFIG_C)
}
// B/module.config
CONFIG_C.ext.args = '--bar'
CONFIG_C.publishDir = params.outdir
// B/main.nf
workflow B {
take:
config_c_extra
main:
C.config = CONFIG_C + config_c_extra
C()
}
// C/main.nf
process C {
}
This strikes me as more complicated than it needs to be. The user is basically re-implementing the logic of process selectors in their pipeline code. While this programmability enables things that are not possible with (declarative) process selectors, I don't think uses are asking for those things.
As a baseline, I will first implement the module config with process selectors. If it enables the config to be fully declarative then that should be enough. Perhaps then we can also try the programmatic config and see what users prefer.
Rewriting the above example with process selectors:
// A/module.config
withName:'A:B:C' {
ext.args = '--foo'
}
// A/main.nf
workflow A {
B()
}
// B/module.config
withName:'B:C' {
ext.args = '--bar'
publishDir = params.outdir
}
// B/main.nf
workflow B {
C()
}
// C/main.nf
process C {
}
Much simpler!
Ok, we talked. Let's go ahead for the first iteration using standard process selector syntax applied to process *included* in the local module.
Okay, the PR and rnaseq POC are working quite well. I was able to remove all of the if statements without triggering any config selector warnings, even for processes that aren't always used. This is because the module config is not applied to the process until the process is actually invoked, so there will never be any hanging configs. On the other hand, if you misspell a process selector then it should still warn you.
One minor detail is that the module config file name must always be the same, currently it is module.config
. This is because Nextflow loads the module config by convention, the module script does not reference the config file directly.
As an alternative, we could do something like includeConfig
for module scripts, which would allow the module script to explicitly import a config file by name:
// main.nf
includeConfig 'module.config'
process FOO {
// ...
}
This way, users could name the config file whatever they want. On the other hand, it wouldn't make sense to support both approaches, so if we went this way then users would just have to specify the same include line of code many times. So I think I would rather not do this.
Related: made an issue about reducing config duplication of publishDir
: https://github.com/nextflow-io/nextflow/issues/4661
Context
The use of DSL2 introduces the use of module components that allow the inclusion and re-use a process more than one time with different names.
A common requirement in this scenario is that a process configuration needs to be customised, for example, to provide a specific
publishDir
path or command arguments.Currently, this need is addressed by creating a monolithic configuration file in which the process configuration is provided by using the target process name. For example here
https://github.com/nf-core/rnaseq/blob/master/conf/modules.config#L174-L192
In some case, there's a conditional logic that needs to be duplicated in the workflow script. For example here:
https://github.com/nf-core/rnaseq/blob/master/conf/modules.config#L209-L224
This approach has several disadvantages:
Solution
A possible solution could be to introduce a module-level configuration file that allows the definition of named configuration template that can applied to one or more processes in the same module.
For example the following configuration file could be added in a workflow module:
Note: in the example above,
FOO
andBAR
are arbitrary identifiers the user chooses. They have no relationship with the process name where they are meant to be used.In the workflow module, they could be used using the following syntax:
Note: explorative syntax, it may change
In the above example, the process
PROC1
is executed having the config setting defined by the template with nameFOO
is applied. Same with with processPROC2
with configBAR
.Conditional configuration could be defined using usual
if
and expression statement e.g.This approach has several advantages:
withName
selector mechanismCaveats
module.config
ormain.config
withName
selector should be able to override settings defined by a config template (to not break existing config e.g. groundswell)