Open pditommaso opened 2 years ago
I think the problem comes from the current implementation because we are overwriting the current ConfigSlurper trying to do more things at the same time instead of first parsing the configuration and after applying them
For example, we implement the profile
feature trying to merge the different configurations on the fly but maybe is to "easy" as use the environment
feature implemented in ConfigSlurper (https://wjw465150.github.io/blog/Groovy/my_data/%E5%AE%98%E6%96%B9%E4%BE%8B%E5%AD%90/Groovy%20-%20ConfigSlurper.htm) as many profiles the user wants and after merge all of them
About malicious code we can execute the evaluation of the configuration into a secured context with minimal imports and no access to the rest of the application
Likely, another side of the problem is that to parse the file you need to compile the corresponding groovy code. Powerful but in some context cannot be done.
What do you think about wrapping some of the functionality from the @nf-core JSON Schema into the new config file format? I'm mostly thinking variable type, but enum
, pattern
and required
would also be nice. It would be nice to move as much of the custom nf-core validation code into core-Nextflow as possible 😊 (we've previously been aiming at a plugin, but this would be even better)
Some of the longer-form things like description and help text may be better suited to staying in the JSON Schema.. Keeping that file also allows us to customise and play with future extensions easily, so I'm not suggesting we kill it completely (though I'm not completely against that either).
Also, an early heads-up: we make heavy use of includeConfig
within @nf-core configs to pull remote config files (eg. shared nf-core/configs) and it works brilliantly. So if it's possible to keep this functionality of including arbitrary config URLs then that would be ideal 👍🏻
Static typing and JSON schema start to smell overkilling. IMO it should be the good parts of the current system i.e. variables interpolation, profiles
, includeConfig
without the bad parts ie closures, custom functions, conditional statements, etc
Yeah I thought you might say that 😅 But if it's optional then it needn't be much? Hypothetical example based on the current syntax:
params {
foo = 'bar', required: true
baz = 'qux', type: 'int'
}
For many people this would be sufficient to completely remove nextflow_schema.json
and all dependencies on the nf-core toolchain, whilst giving much more power for catching user error.
The current model has similarities to Spring boot configuration properties. Hierarchical config, profiles, variable interpretation are must haves. https://docs.spring.io/spring-boot/docs/current/reference/html/features.html#features.external-config
That model is great for Java apps. The validation logic though is separate and could be complex to recreate in a pipeline.
We want to enforce our pipelines to use nf-cores models, but it has to be customized for our internal needs. If some of those validation concepts or other items could be brought in natively or via plugins that would be awesome.
Another interesting format can be toml
https://toml.io/en/
I'm playing with a java parser (https://github.com/mwanji/toml4j) and it's very interesting because it can map configuration files against Java classes instead simple maps
an (invented) configuration could be:
includeConfig = [ 'b.toml', 'http://localhost/remote.toml' ]
[aws]
accessKey = '123'
[docker]
enabled = false
envWhitelist = ['a', 'b', "c"]
memory = 210
[profiles]
[profiles.aws]
[profiles.aws.process]
executor = 'sge'
queue = 'long'
memory = '10GB'
[profiles.cloud.process]
executor = 'sge'
queue = 'long'
memory = '1GB'
data = [ ["delta", "phi"], [3.14] ]
Maybe I missed this in the thread already, but is there a reason that we can't use YAML? I feel like that's the format that most people are used to..
Equivalent YAML of the TOML above could be:
!include b.yml
!include "http://localhost/remote.toml"
aws:
accessKey: 123
docker:
enabled: false
envWhitelist: ['a', 'b', "c"]
memory: 210
profiles:
aws:
process:
executor: sge
queue: long
memory: 10GB
cloud:
process:
executor: sge
queue: long
memory: 1GB
data:
- ["delta", "phi"]
- [3.14]
The !include
bit is a bit non-standard, but similar things are used by a lot of other projects (eg. pyyaml-include).
YAML has the advantage that many Nextflow users are already using it, with the -params-file
option. The only difference here would be to move it up a level, so that the YAML doesn't just go to params
but also across other config scopes.
From my side, I'm only investigating other formats that can be interesting to consider
of course, YAML is probably the most famous format (but also has a lot of problems) and we have libraries that allow us to parse yml files
Biggest problem yaml does not have variables interpolation
HIML - hierarchical yaml config with variable interpolation: https://github.com/adobe/himl#interpolation Unfortunately, it is in python and I couldn't find a java flavor.
Yeah, I was thinking YAML anchors and aliases but that's full values and not partial strings.
I was just looking at a few nextflow.config
files from nf-core pipelines and thinking it wasn't so bad, but the really scary ones are in conf/modules.config
. For example, in nf-core/rnaseq conf/modules.config
we have stuff like this:
def rseqc_modules = params.rseqc_modules ? params.rseqc_modules.split(',').collect{ it.trim().toLowerCase() } : []
process {
publishDir = [
path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
]
// ...
withName: 'SALMON_INDEX' {
ext.args = params.gencode ? '--gencode' : ''
publishDir = [
path: { "${params.outdir}/genome/index" },
mode: params.publish_dir_mode,
saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
enabled: params.save_reference
]
}
}
if (!(params.skip_fastqc || params.skip_qc)) {
process {
withName: '.*:FASTQC_UMITOOLS_TRIMGALORE:FASTQC' {
ext.args = '--quiet'
}
}
}
I can't imagine all of this surviving a new format, but the less that we have to rewrite the better..
Yep, some of that syntax can be improved due to recent additions to NF (e.g. https://github.com/nextflow-io/nextflow/issues/2700) but it is really a side-effect of truly modularising within and between pipelines.
I think it goes without saying that it would be nice to test any prototype config format with an nf-core pipeline because I'm sure we will be able to unravel some edge cases!
yaml, gets hairy when when it comes to anything non-trivial -- check out issues in gitlab CI support. toml is a nicer version of the same sort of thing.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Am interested in this line you have in the original post...
By this do you mean to have a formal grammar for the NextFlow DSL e.g. using BNF notation ? This would be really useful.
Some of the problems described here (like with the nf-core configs) can be addressed in other ways like module config (#4510) and resourceLimits directive (#2911). So I think in this issue we can focus on parsing and variable evaluation.
I have developed a custom parser for Nextflow config files which defines a formal grammar that is much more strict:
It would allow us to have much better control over the parsing and evaluation compared to the current system. Could also be used to develop IDE tooling for config files like linting and code completion.
The key question around whether to move to a new format or improve the current one is how to handle variable interpolation:
Would it be unreasonable to support both Nextflow config and YAML/TOML? Then we could say "if you need to use dynamic expressions, use Nextflow config, otherwise YAML is fine". We could still support config inclusion and dynamic strings in YAML, but we wouldn't have to come up with some kludgy solution to support Groovy closures in YAML. And with the custom parser, we could bring the developer experience for Nextflow configs up to par with YAML.
You're dangerously close to making GroovyMake with Groovy code embedded in YAML 🤣 .
Your choice is a conundrum, on the one hand it represents a loss of functionality to ditch the current system in favour of something like TOML that could be read by external systems. (An aside, if you allow groovy expressions as strings in those the external tools would need to know how to interpret groovy). On the other, its very powerful for external systems to be able to read the configuration files and be able to perform tasks such as parameter validation.
The central issue here is that the current config files are overloaded in their duty. The can contain both data and code; what's needed I feel is a better separation of those. The nf-core idea of parameter schemas was a step in this direction, however they are not tightly integrated into Nextflow and don't replace the native config files; which leads to all manner of tedious issues.
So perhaps you want a system whereby best practice is to reserve the current nexflow config files for dynamic configuration, and data resides in a document freely parseable by any external tool. You could likely get a long way to this goal by writing a joint parser for nextflow config files and nf-core schemas that populates the former with data from the later.
I don't think the parameter schema should be part of the config file. It should be in the top-level workflow definition as inputs, or in a separate file like the nf-core parameter schema. Config files should be able to reference params and even set default values, but they should not be the source of truth for params.
So now I'm wondering, if you take away that example, what other use case is there for loading a Nextflow config file in an external tool? If there aren't any others, then maybe external tooling isn't important and we can safely stick to the current config format.
The problem with keeping the parameter schema separate from the config pipeline defaults is that the two need to be kept in sync. It's irritating to need to add the same thing in two places as a pipeline developer.
There are two things that need to be loaded by external tools - the schema (to build launch tools / render documentation) and the config (to pre-populate forms and validate inputs before running Nextflow).
I can't tell who's violently agreeing with who.
I think I'm violently agreeing with @ewels: use something generic like the nf-core schema for loading and understanding static data that may ultimately flow from the user. Use something like the currently nextflow config to provide flexibility to developers and power users for augmenting the runtime environment. This would mean narrowing the scope of what the current nextflow config can do (e.g. it won't contain a params { }
stanza`).
@ewels when you say that external tools need to load the schema and the config, by "config" do you mean just the params block?
I'm thinking that config files should still be allowed to set param defaults for backwards compatibility (and because it is useful to do in profiles), but users should be encouraged to specify default values in the schema file so that there is one source of truth as you say. As Chris said, users should avoid having a params
block in their config file.
I think we are all in violent agreement
Yes, I think so 👍🏻 We already encourage people to use -params-yaml
, so I guess the missing piece would be to get Nextflow to parse and use nextflow_schema.json
? (or equivalent). Then the params could be removed from nextflow.config
and the pipeline developer would only need to care about the schema file.
...and ~remove~ override nextflow's dodgy type casting in favour of whats declared in the schema
This is getting away from the config file, but what do you think about putting the params schema in the pipeline code? The top-level anonymous workflow could have an "input" section that defines inputs with type, default value, etc. This way, the language server could ensure that the params are used correctly in the pipeline code. Nextflow could export this definition to YAML/JSON for use with external tools
This is naturally not a single source of truth: its just a way of generating a secondary file which could become out of sync with the primary document, just like the current situation.
Let's continue the discussion of the params schema here: #4669
In any case, I think we agree that the params definition should be separate from the config file. That tells me that it would be fine to keep the current config format and implement a custom parser to enforce a simpler syntax without the Groovy quirks.
Rationale
The configuration file is a key component of the Nextflow framework since it allows workflow developers to decouple the pipeline logic from the execution parameters and infrastructure deployment settings.
The current configuration mechanism is extremely flexible and powerful. It uses a customised GroovySlurper class with extended to handle multiple profiles and config file inclusions.
A Nextflow configuration file is ultimately a compiled Java class on steroids.
However, this has serious drawbacks:
Goals
Non-goals
Action
This issue should explore possible alternative implementation and define a migration strategy.
Currently the best alternatives are: