nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.71k stars 622 forks source link

Evolution of Nextflow configuration file #2723

Open pditommaso opened 2 years ago

pditommaso commented 2 years ago

Rationale

The configuration file is a key component of the Nextflow framework since it allows workflow developers to decouple the pipeline logic from the execution parameters and infrastructure deployment settings.

The current configuration mechanism is extremely flexible and powerful. It uses a customised GroovySlurper class with extended to handle multiple profiles and config file inclusions.

A Nextflow configuration file is ultimately a compiled Java class on steroids.

However, this has serious drawbacks:

  1. The evaluation of config file, requires Java class compilation and interpretation, which can be time-consuming compared to text file parsing.
  2. The evaluation of the config requires ultimately the execution of JVM code, which could be explored by malicious users and pose a non-trivial security burden to be handled in a managed environment.
  3. The current configuration allows the use of arbitrary imperative programming statements within a config file, which is a bad practice, making the config logic poorly readable.
  4. The parser implementation is becoming including hard to be maintained, and patches easily introduce unexpected side-effects, resulting in increased stability issues.

Goals

Non-goals

Action

This issue should explore possible alternative implementation and define a migration strategy.

Currently the best alternatives are:

jorgeaguileraseqera commented 2 years ago

I think the problem comes from the current implementation because we are overwriting the current ConfigSlurper trying to do more things at the same time instead of first parsing the configuration and after applying them

For example, we implement the profile feature trying to merge the different configurations on the fly but maybe is to "easy" as use the environment feature implemented in ConfigSlurper (https://wjw465150.github.io/blog/Groovy/my_data/%E5%AE%98%E6%96%B9%E4%BE%8B%E5%AD%90/Groovy%20-%20ConfigSlurper.htm) as many profiles the user wants and after merge all of them

About malicious code we can execute the evaluation of the configuration into a secured context with minimal imports and no access to the rest of the application

pditommaso commented 2 years ago

Likely, another side of the problem is that to parse the file you need to compile the corresponding groovy code. Powerful but in some context cannot be done.

ewels commented 2 years ago

What do you think about wrapping some of the functionality from the @nf-core JSON Schema into the new config file format? I'm mostly thinking variable type, but enum, pattern and required would also be nice. It would be nice to move as much of the custom nf-core validation code into core-Nextflow as possible 😊 (we've previously been aiming at a plugin, but this would be even better)

Some of the longer-form things like description and help text may be better suited to staying in the JSON Schema.. Keeping that file also allows us to customise and play with future extensions easily, so I'm not suggesting we kill it completely (though I'm not completely against that either).

ewels commented 2 years ago

Also, an early heads-up: we make heavy use of includeConfig within @nf-core configs to pull remote config files (eg. shared nf-core/configs) and it works brilliantly. So if it's possible to keep this functionality of including arbitrary config URLs then that would be ideal 👍🏻

pditommaso commented 2 years ago

Static typing and JSON schema start to smell overkilling. IMO it should be the good parts of the current system i.e. variables interpolation, profiles, includeConfig without the bad parts ie closures, custom functions, conditional statements, etc

ewels commented 2 years ago

Yeah I thought you might say that 😅 But if it's optional then it needn't be much? Hypothetical example based on the current syntax:

params {
    foo = 'bar', required: true
    baz = 'qux', type: 'int'
}

For many people this would be sufficient to completely remove nextflow_schema.json and all dependencies on the nf-core toolchain, whilst giving much more power for catching user error.

mbwhelan commented 2 years ago

The current model has similarities to Spring boot configuration properties. Hierarchical config, profiles, variable interpretation are must haves. https://docs.spring.io/spring-boot/docs/current/reference/html/features.html#features.external-config

That model is great for Java apps. The validation logic though is separate and could be complex to recreate in a pipeline.

We want to enforce our pipelines to use nf-cores models, but it has to be customized for our internal needs. If some of those validation concepts or other items could be brought in natively or via plugins that would be awesome.

jorgeaguileraseqera commented 2 years ago

Another interesting format can be toml https://toml.io/en/

I'm playing with a java parser (https://github.com/mwanji/toml4j) and it's very interesting because it can map configuration files against Java classes instead simple maps

an (invented) configuration could be:

includeConfig = [  'b.toml', 'http://localhost/remote.toml' ]
[aws]
        accessKey = '123'
[docker]
    enabled = false
    envWhitelist = ['a', 'b', "c"]
    memory = 210

[profiles]
    [profiles.aws]
        [profiles.aws.process]
            executor = 'sge'
            queue = 'long'
            memory = '10GB'
        [profiles.cloud.process]
            executor = 'sge'
            queue = 'long'
            memory = '1GB'
            data = [ ["delta", "phi"], [3.14] ]
ewels commented 2 years ago

Maybe I missed this in the thread already, but is there a reason that we can't use YAML? I feel like that's the format that most people are used to..

Equivalent YAML of the TOML above could be:

!include b.yml
!include "http://localhost/remote.toml"
aws:
    accessKey: 123
docker:
    enabled: false
    envWhitelist: ['a', 'b', "c"]
    memory: 210

profiles:
    aws:
        process:
            executor: sge
            queue: long
            memory: 10GB
    cloud:
        process:
            executor: sge
            queue: long
            memory: 1GB
            data:
                - ["delta", "phi"]
                - [3.14]

The !include bit is a bit non-standard, but similar things are used by a lot of other projects (eg. pyyaml-include).

YAML has the advantage that many Nextflow users are already using it, with the -params-file option. The only difference here would be to move it up a level, so that the YAML doesn't just go to params but also across other config scopes.

jorgeaguileraseqera commented 2 years ago

From my side, I'm only investigating other formats that can be interesting to consider

of course, YAML is probably the most famous format (but also has a lot of problems) and we have libraries that allow us to parse yml files

pditommaso commented 2 years ago

Biggest problem yaml does not have variables interpolation

MrMarkW commented 2 years ago

HIML - hierarchical yaml config with variable interpolation: https://github.com/adobe/himl#interpolation Unfortunately, it is in python and I couldn't find a java flavor.

ewels commented 2 years ago

Yeah, I was thinking YAML anchors and aliases but that's full values and not partial strings.

I was just looking at a few nextflow.config files from nf-core pipelines and thinking it wasn't so bad, but the really scary ones are in conf/modules.config. For example, in nf-core/rnaseq conf/modules.config we have stuff like this:

def rseqc_modules = params.rseqc_modules ? params.rseqc_modules.split(',').collect{ it.trim().toLowerCase() } : []

process {
    publishDir = [
        path: { "${params.outdir}/${task.process.tokenize(':')[-1].tokenize('_')[0].toLowerCase()}" },
        mode: params.publish_dir_mode,
        saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
    ]

   // ...

    withName: 'SALMON_INDEX' {
        ext.args   = params.gencode ? '--gencode' : ''
        publishDir = [
            path: { "${params.outdir}/genome/index" },
            mode: params.publish_dir_mode,
            saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
            enabled: params.save_reference
        ]
    }
}

if (!(params.skip_fastqc || params.skip_qc)) {
    process {
        withName: '.*:FASTQC_UMITOOLS_TRIMGALORE:FASTQC' {
            ext.args   = '--quiet'
        }
    }
}

I can't imagine all of this surviving a new format, but the less that we have to rewrite the better..

drpatelh commented 2 years ago

Yep, some of that syntax can be improved due to recent additions to NF (e.g. https://github.com/nextflow-io/nextflow/issues/2700) but it is really a side-effect of truly modularising within and between pipelines.

I think it goes without saying that it would be nice to test any prototype config format with an nf-core pipeline because I'm sure we will be able to unravel some edge cases!

cjw85 commented 2 years ago

yaml, gets hairy when when it comes to anything non-trivial -- check out issues in gitlab CI support. toml is a nicer version of the same sort of thing.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

emyr666 commented 10 months ago

Am interested in this line you have in the original post...

By this do you mean to have a formal grammar for the NextFlow DSL e.g. using BNF notation ? This would be really useful.

bentsherman commented 8 months ago

Some of the problems described here (like with the nf-core configs) can be addressed in other ways like module config (#4510) and resourceLimits directive (#2911). So I think in this issue we can focus on parsing and variable evaluation.

I have developed a custom parser for Nextflow config files which defines a formal grammar that is much more strict:

It would allow us to have much better control over the parsing and evaluation compared to the current system. Could also be used to develop IDE tooling for config files like linting and code completion.

The key question around whether to move to a new format or improve the current one is how to handle variable interpolation:

Would it be unreasonable to support both Nextflow config and YAML/TOML? Then we could say "if you need to use dynamic expressions, use Nextflow config, otherwise YAML is fine". We could still support config inclusion and dynamic strings in YAML, but we wouldn't have to come up with some kludgy solution to support Groovy closures in YAML. And with the custom parser, we could bring the developer experience for Nextflow configs up to par with YAML.

cjw85 commented 8 months ago

You're dangerously close to making GroovyMake with Groovy code embedded in YAML 🤣 .

Your choice is a conundrum, on the one hand it represents a loss of functionality to ditch the current system in favour of something like TOML that could be read by external systems. (An aside, if you allow groovy expressions as strings in those the external tools would need to know how to interpret groovy). On the other, its very powerful for external systems to be able to read the configuration files and be able to perform tasks such as parameter validation.

The central issue here is that the current config files are overloaded in their duty. The can contain both data and code; what's needed I feel is a better separation of those. The nf-core idea of parameter schemas was a step in this direction, however they are not tightly integrated into Nextflow and don't replace the native config files; which leads to all manner of tedious issues.

So perhaps you want a system whereby best practice is to reserve the current nexflow config files for dynamic configuration, and data resides in a document freely parseable by any external tool. You could likely get a long way to this goal by writing a joint parser for nextflow config files and nf-core schemas that populates the former with data from the later.

bentsherman commented 8 months ago

I don't think the parameter schema should be part of the config file. It should be in the top-level workflow definition as inputs, or in a separate file like the nf-core parameter schema. Config files should be able to reference params and even set default values, but they should not be the source of truth for params.

So now I'm wondering, if you take away that example, what other use case is there for loading a Nextflow config file in an external tool? If there aren't any others, then maybe external tooling isn't important and we can safely stick to the current config format.

ewels commented 8 months ago

The problem with keeping the parameter schema separate from the config pipeline defaults is that the two need to be kept in sync. It's irritating to need to add the same thing in two places as a pipeline developer.

There are two things that need to be loaded by external tools - the schema (to build launch tools / render documentation) and the config (to pre-populate forms and validate inputs before running Nextflow).

cjw85 commented 8 months ago

I can't tell who's violently agreeing with who.

I think I'm violently agreeing with @ewels: use something generic like the nf-core schema for loading and understanding static data that may ultimately flow from the user. Use something like the currently nextflow config to provide flexibility to developers and power users for augmenting the runtime environment. This would mean narrowing the scope of what the current nextflow config can do (e.g. it won't contain a params { } stanza`).

bentsherman commented 8 months ago

@ewels when you say that external tools need to load the schema and the config, by "config" do you mean just the params block?

I'm thinking that config files should still be allowed to set param defaults for backwards compatibility (and because it is useful to do in profiles), but users should be encouraged to specify default values in the schema file so that there is one source of truth as you say. As Chris said, users should avoid having a params block in their config file.

I think we are all in violent agreement

ewels commented 8 months ago

Yes, I think so 👍🏻 We already encourage people to use -params-yaml, so I guess the missing piece would be to get Nextflow to parse and use nextflow_schema.json? (or equivalent). Then the params could be removed from nextflow.config and the pipeline developer would only need to care about the schema file.

cjw85 commented 8 months ago

...and ~remove~ override nextflow's dodgy type casting in favour of whats declared in the schema

bentsherman commented 8 months ago

This is getting away from the config file, but what do you think about putting the params schema in the pipeline code? The top-level anonymous workflow could have an "input" section that defines inputs with type, default value, etc. This way, the language server could ensure that the params are used correctly in the pipeline code. Nextflow could export this definition to YAML/JSON for use with external tools

cjw85 commented 8 months ago

This is naturally not a single source of truth: its just a way of generating a secondary file which could become out of sync with the primary document, just like the current situation.

bentsherman commented 8 months ago

Let's continue the discussion of the params schema here: #4669

In any case, I think we agree that the params definition should be separate from the config file. That tells me that it would be fine to keep the current config format and implement a custom parser to enforce a simpler syntax without the Groovy quirks.